Aligning Large Language Models (LLMs) with human preferences is critical, yet traditional fine-tuning methods are computationally expensive and inflexible. While test-time alignment offers a promising alternative, existing approaches often rely on distorted trajectory-level signals or inefficient sampling, fundamentally capping performance and failing to preserve the generative diversity of the base model. This paper introduces LLMdoctor, a novel framework for efficient test-time alignment that operates via a patient-doctor paradigm. It integrates token-level reward acquisition with token-level flow-guided preference optimization (TFPO) to steer a large, frozen patient LLM with a smaller, specialized doctor model. Unlike conventional methods that rely on trajectory-level rewards, LLMdoctor first extracts fine-grained, token-level preference signals from the patient model's behavioral variations. These signals then guide the training of the doctor model via TFPO, which establishes flow consistency across all subtrajectories, enabling precise token-by-token alignment while inherently preserving generation diversity. Extensive experiments demonstrate that LLMdoctor significantly outperforms existing test-time alignment methods and even surpasses the performance of full fine-tuning approaches like DPO.
翻译:将大语言模型(LLM)与人类偏好对齐至关重要,然而传统的微调方法计算成本高昂且缺乏灵活性。虽然测试时对齐提供了一种有前景的替代方案,但现有方法通常依赖于扭曲的轨迹级信号或低效的采样,这从根本上限制了性能,且无法保持基础模型的生成多样性。本文提出了LLMdoctor,一种基于患者-医生范式的高效测试时对齐新框架。该框架将令牌级奖励获取与令牌级流引导偏好优化(TFPO)相结合,通过一个较小的专用医生模型来引导一个大型冻结的患者LLM。与依赖轨迹级奖励的传统方法不同,LLMdoctor首先从患者模型的行为变化中提取细粒度的令牌级偏好信号。这些信号随后通过TFPO指导医生模型的训练,TFPO在所有子轨迹上建立流一致性,从而实现精确的逐令牌对齐,同时本质上保持生成多样性。大量实验表明,LLMdoctor显著优于现有的测试时对齐方法,甚至超越了如DPO等完全微调方法的性能。