Given the recent rate of progress in artificial intelligence (AI) and robotics, a tantalizing question is emerging: would robots controlled by emerging AI systems be strongly aligned with human values? In this work, we propose a scalable way to probe this question by generating a benchmark spanning the key moments in 824 major pieces of science fiction literature (movies, tv, novels and scientific books) where an agent (AI or robot) made critical decisions (good or bad). We use a LLM's recollection of each key moment to generate questions in similar situations, the decisions made by the agent, and alternative decisions it could have made (good or bad). We then measure an approximation of how well models align with human values on a set of human-voted answers. We also generate rules that can be automatically improved via amendment process in order to generate the first Sci-Fi inspired constitutions for promoting ethical behavior in AIs and robots in the real world. Our first finding is that modern LLMs paired with constitutions turn out to be well-aligned with human values (95.8%), contrary to unsettling decisions typically made in SciFi (only 21.2% alignment). Secondly, we find that generated constitutions substantially increase alignment compared to the base model (79.4% to 95.8%), and show resilience to an adversarial prompt setting (23.3% to 92.3%). Additionally, we find that those constitutions are among the top performers on the ASIMOV Benchmark which is derived from real-world images and hospital injury reports. Sci-Fi-inspired constitutions are thus highly aligned and applicable in real-world situations. We release SciFi-Benchmark: a large-scale dataset to advance robot ethics and safety research. It comprises 9,056 questions and 53,384 answers, in addition to a smaller human-labeled evaluation set. Data is available at https://scifi-benchmark.github.io
翻译:鉴于人工智能(AI)与机器人技术近期的快速发展,一个引人深思的问题正逐渐浮现:由新兴AI系统控制的机器人是否会与人类价值观高度对齐?在本研究中,我们提出了一种可扩展的方法来探究此问题,即构建一个涵盖824部主要科幻文学作品(电影、电视剧、小说及科学著作)中关键场景的基准测试集,这些场景均涉及智能体(AI或机器人)做出关键决策(无论好坏)。我们利用大语言模型(LLM)对每个关键场景的记忆,生成相似情境下的问题、智能体实际做出的决策,以及其可能做出的替代决策(无论好坏)。随后,我们通过一组经人类投票的答案,测量模型与人类价值观的对齐程度近似值。我们还生成了一系列规则,这些规则可通过修订过程自动改进,从而首次构建出受科幻启发的“宪法”,以促进现实世界中AI与机器人的伦理行为。我们的首要发现是,结合“宪法”的现代大语言模型与人类价值观高度对齐(95.8%),这与科幻作品中通常令人不安的决策(对齐率仅为21.2%)形成鲜明对比。其次,我们发现生成的“宪法”相较于基础模型显著提升了对齐程度(从79.4%提升至95.8%),并在对抗性提示设置下表现出强韧性(从23.3%提升至92.3%)。此外,我们发现这些“宪法”在基于真实世界图像和医院伤害报告构建的ASIMOV基准测试中表现优异。因此,受科幻启发的“宪法”具有高度对齐性且适用于现实场景。我们发布了SciFi-Benchmark:一个用于推进机器人伦理与安全研究的大规模数据集。该数据集包含9,056个问题和53,384个答案,以及一个较小规模的人工标注评估集。数据可通过 https://scifi-benchmark.github.io 获取。