This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback (RLxF) methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLxF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics and contributing to AI safety. We highlight tensions and contradictions inherent in the goals of RLxF. In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLxF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We conclude by urging researchers and practitioners alike to critically assess the sociotechnical ramifications of RLxF, advocating for a more nuanced and reflective approach to its application in AI development.
翻译:本文批判性地评估了通过反馈强化学习(RLxF)方法(包括人类反馈强化学习[RLHF]与人工智能反馈强化学习[RLAIF])使人工智能(AI)系统(特别是大语言模型[LLMs])与人类价值观及意图对齐的尝试。我们具体揭示了当前广泛追求的对齐目标——诚实性、无害性与有益性——存在的缺陷。通过跨学科的社会技术批判,我们检视了RLxF技术的理论基础与实践应用,发现其在把握人类伦理复杂性及促进AI安全方面存在显著局限。我们着重分析了RLxF目标中固有的张力与矛盾。此外,我们讨论了当前对齐及RLxF讨论中常被忽视的伦理相关问题,包括用户友好性与欺骗性、灵活性与可解释性,以及系统安全性之间的权衡。最后,我们呼吁研究者与实践者批判性评估RLxF的社会技术影响,倡导在AI开发中采用更细致、更具反思性的应用路径。