Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.
翻译:个性化对于提升语言技术的效能及用户满意度至关重要,尤其是在问答等信息检索任务中。当前针对大型语言模型(LLM)个性化的方法,通常依赖检索增强生成(RAG),并辅以基于标量奖励信号的强化学习,以训练模型如何使用检索到的个性化上下文。我们认为,这些标量奖励有时提供的反馈较弱且缺乏指导性,从而限制了学习效率与个性化质量。我们提出了VAC,一个用于个性化回复生成的新颖框架,该框架用基于用户画像与问题叙事的自然语言反馈(NLF)替代了标量奖励。NLF作为一种丰富且可操作的监督信号,使策略模型能够迭代优化其输出,并内化有效的个性化策略。训练过程交替进行,包括优化反馈模型以及基于改进后的回复对策略模型进行微调,最终得到的策略模型在推理时不再需要反馈。在涵盖三个不同领域的LaMP-QA基准测试上的评估表明,该方法相较于现有最先进结果取得了持续且显著的改进。人工评估进一步证实了生成回复的卓越质量。这些结果证明,NLF为优化个性化问答提供了更有效的信号。