Personalization and contextual coherence are two essential components in building effective persona-grounded dialogue systems. These aspects play a crucial role in enhancing user engagement and ensuring responses are more relevant and consistent with user identity. However, recent studies indicate that open-source large language models (LLMs) continue to struggle to generate responses that are both contextually grounded and aligned with persona cues, despite exhibiting strong general conversational abilities like fluency and naturalness. We present PersoDPO, a scalable preference optimisation framework that uses supervision signals from automatic evaluations of responses generated by both closed-source and open-source LLMs to fine-tune dialogue models. The framework integrates evaluation metrics targeting coherence and personalization, along with a length-format compliance feature to promote instruction adherence. These signals are combined to automatically construct high-quality preference pairs without manual annotation, enabling a scalable and reproducible training pipeline. Experiments on the FoCus dataset show that an open-source language model fine-tuned with the PersoDPO framework consistently outperforms strong open-source baselines and a standard Direct Preference Optimization (DPO) variant across multiple evaluation dimensions.
翻译:个性化和上下文连贯性是构建有效角色驱动对话系统的两个基本要素。这些方面对于提升用户参与度、确保回复与用户身份更相关且一致至关重要。然而,近期研究表明,尽管开源大语言模型展现出强大的通用对话能力(如流畅性和自然度),但在生成既符合上下文又契合角色提示的回复方面仍存在困难。本文提出PersoDPO,一个可扩展的偏好优化框架,该框架利用对闭源和开源大语言模型生成回复的自动评估所产生的监督信号来微调对话模型。该框架整合了针对连贯性和个性化的评估指标,以及一个长度-格式合规性特征以促进指令遵从。这些信号被结合用于自动构建高质量偏好对,无需人工标注,从而实现了一个可扩展且可复现的训练流程。在FoCus数据集上的实验表明,采用PersoDPO框架微调的开源语言模型在多个评估维度上持续优于强大的开源基线模型和一个标准的直接偏好优化变体。