AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations

This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback (RLxF) methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLxF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics and contributing to AI safety. We highlight tensions and contradictions inherent in the goals of RLxF. In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLxF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We conclude by urging researchers and practitioners alike to critically assess the sociotechnical ramifications of RLxF, advocating for a more nuanced and reflective approach to its application in AI development.

翻译：本文批判性地评估了通过反馈强化学习（RLxF）方法（包括人类反馈强化学习[RLHF]与人工智能反馈强化学习[RLAIF]）使人工智能（AI）系统（特别是大语言模型[LLMs]）与人类价值观及意图对齐的尝试。我们具体揭示了当前广泛追求的对齐目标——诚实性、无害性与有益性——存在的缺陷。通过跨学科的社会技术批判，我们检视了RLxF技术的理论基础与实践应用，发现其在把握人类伦理复杂性及促进AI安全方面存在显著局限。我们着重分析了RLxF目标中固有的张力与矛盾。此外，我们讨论了当前对齐及RLxF讨论中常被忽视的伦理相关问题，包括用户友好性与欺骗性、灵活性与可解释性，以及系统安全性之间的权衡。最后，我们呼吁研究者与实践者批判性评估RLxF的社会技术影响，倡导在AI开发中采用更细致、更具反思性的应用路径。

相关内容

关注 7103

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日