As large language models (LLMs) demonstrate increasingly advanced capabilities, aligning their behaviors with human values and preferences becomes crucial for their wide adoption. While previous research focuses on general alignment to principles such as helpfulness, harmlessness, and honesty, the need to account for individual and diverse preferences has been largely overlooked, potentially undermining customized human experiences. To address this gap, we train LLMs that can ''interact to align'', essentially cultivating the meta-skill of LLMs to implicitly infer the unspoken personalized preferences of the current user through multi-turn conversations, and then dynamically align their following behaviors and responses to these inferred preferences. Our approach involves establishing a diverse pool of 3,310 distinct user personas by initially creating seed examples, which are then expanded through iterative self-generation and filtering. Guided by distinct user personas, we leverage multi-LLM collaboration to develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures. Finally, we apply supervised fine-tuning and reinforcement learning to enhance LLMs using this dataset. For evaluation, we establish the ALOE (ALign With CustOmized PrEferences) benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations. Experimental results demonstrate the effectiveness of our method in enabling dynamic, personalized alignment via interaction.
翻译:随着大语言模型(LLMs)展现出日益先进的能力,将其行为与人类价值观及偏好对齐,对于其广泛应用至关重要。先前的研究主要关注于通用性原则(如助益性、无害性、诚实性)的对齐,而满足个体化、多样化偏好的需求在很大程度上被忽视,这可能削弱定制化的人类体验。为弥补这一不足,我们训练能够“通过交互实现对齐”的LLMs,其核心在于培养LLMs的元能力,即通过多轮对话隐式推断当前用户未明言的个性化偏好,并随后动态地将其后续行为和响应与这些推断出的偏好对齐。我们的方法首先通过创建种子示例,建立了一个包含3,310个不同用户角色的多样化池,随后通过迭代自生成和过滤进行扩展。在特定用户角色的引导下,我们利用多LLM协作,开发了一个包含3,000+个树状结构多轮对话的偏好数据集。最后,我们应用监督微调和强化学习,利用该数据集增强LLMs。为进行评估,我们建立了ALOE(ALign With CustOmized PrEferences)基准,该基准包含100个精心挑选的示例和精心设计的指标,用于衡量对话过程中的定制化对齐性能。实验结果证明了我们的方法在通过交互实现动态、个性化对齐方面的有效性。