Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback

Large language models (LLMs) are used to generate content for a wide range of tasks, and are set to reach a growing audience in coming years due to integration in product interfaces like ChatGPT or search engines like Bing. This intensifies the need to ensure that models are aligned with human preferences and do not produce unsafe, inaccurate or toxic outputs. While alignment techniques like reinforcement learning with human feedback (RLHF) and red-teaming can mitigate some safety concerns and improve model capabilities, it is unlikely that an aggregate fine-tuning process can adequately represent the full range of users' preferences and values. Different people may legitimately disagree on their preferences for language and conversational norms, as well as on values or ideologies which guide their communication. Personalising LLMs through micro-level preference learning processes may result in models that are better aligned with each user. However, there are several normative challenges in defining the bounds of a societally-acceptable and safe degree of personalisation. In this paper, we ask how, and in what ways, LLMs should be personalised. First, we review literature on current paradigms for aligning LLMs with human feedback, and identify issues including (i) a lack of clarity regarding what alignment means; (ii) a tendency of technology providers to prescribe definitions of inherently subjective preferences and values; and (iii) a 'tyranny of the crowdworker', exacerbated by a lack of documentation in who we are really aligning to. Second, we present a taxonomy of benefits and risks associated with personalised LLMs, for individuals and society at large. Finally, we propose a three-tiered policy framework that allows users to experience the benefits of personalised alignment, while restraining unsafe and undesirable LLM-behaviours within (supra-)national and organisational bounds.

翻译：大语言模型（LLMs）被用于生成各类任务的文本内容，且因集成至ChatGPT等产品界面或必应等搜索引擎，未来几年将触达更广泛的受众。这进一步强化了确保模型与人类偏好对齐、避免生成不安全、不准确或有毒输出的需求。尽管强化学习从人类反馈（RLHF）和红队测试等对齐技术能缓解部分安全隐患并提升模型能力，但聚合微调过程难以充分代表用户多样的偏好与价值观。不同人群在语言习惯、对话规范以及指导沟通的价值观或意识形态上可能存在合理分歧。通过微观层面的偏好学习实现LLM个性化，可生成与各用户更契合的模型。然而，界定社会可接受且安全的个性化边界面临若干规范性挑战。本文探讨LLM应如何及在哪些维度进行个性化。首先，我们梳理当前LLM与人类反馈对齐范式的研究，识别出以下问题：（i）"对齐"概念缺乏明确界定；（ii）技术提供者倾向于对本质主观的偏好与价值观进行定义；（iii）"众包工人的暴政"因缺乏对齐对象的文档化而加剧。其次，我们提出了个性化LLM对个人及社会整体的收益与风险分类体系。最后，我们构建三级政策框架：在允许用户体验个性化对齐优势的同时，将不安全及不良LLM行为约束在（超）国家与组织边界内。