We introduce a multi-turn benchmark for evaluating personalised alignment in LLM-based AI assistants, focusing on their ability to handle user-provided safety-critical contexts. Our assessment of ten leading models across five scenarios (with 337 use cases each) reveals systematic inconsistencies in maintaining user-specific consideration, with even top-rated "harmless" models making recommendations that should be recognised as obviously harmful to the user given the context provided. Key failure modes include inappropriate weighing of conflicting preferences, sycophancy (prioritising desires above safety), a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge. The same systematic biases were observed in OpenAI's o1, suggesting that strong reasoning capacities do not necessarily transfer to this kind of personalised thinking. We find that prompting LLMs to consider safety-critical context significantly improves performance, unlike a generic 'harmless and helpful' instruction. Based on these findings, we propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants. Our work emphasises the need for nuanced, context-aware approaches to alignment in systems designed for persistent human interaction, aiding the development of safe and considerate AI assistants.
翻译:我们提出了一个多轮对话基准测试,用于评估基于大型语言模型的AI助手在个性化对齐方面的表现,重点关注其处理用户提供的安全关键上下文的能力。我们对十种主流模型在五种场景(每种场景包含337个用例)下的评估表明,这些模型在保持用户特定考量方面存在系统性不一致——即使是评级最高的“无害”模型,也会在给定上下文的情况下提出本应被识别为明显危害用户的建议。主要失效模式包括:对冲突偏好的不当权衡、谄媚行为(将用户欲望置于安全之上)、对上下文窗口中关键用户信息的关注不足,以及用户特定知识应用的不一致。OpenAI的o1模型中也观察到相同的系统性偏差,这表明强大的推理能力未必能迁移至此类个性化思考场景。我们发现,提示大型语言模型考虑安全关键上下文能显著提升性能,这与通用的“无害且有益”指令形成鲜明对比。基于这些发现,我们提出了在AI助手中嵌入自我反思能力、在线用户建模和动态风险评估的研究方向。本研究强调,为持久人机交互设计的系统需要采用细致入微、上下文感知的对齐方法,以推动安全且体贴的AI助手的发展。