ProgressGym: Alignment with a Millennium of Moral Progress

Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at https://github.com/PKU-Alignment/ProgressGym and https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard respectively.

翻译：前沿人工智能系统，包括大语言模型（LLMs），对人类使用者的认知方式正产生日益增长的影响。这种影响可能强化主流社会价值观，潜在地导致错误道德信念被锁定，进而在广泛范围内延续有问题的道德实践。我们提出"进步对齐"作为一种技术解决方案，以缓解这一迫近的风险。进步对齐算法学习模拟人类道德进步的机制，从而解决现有对齐方法易受当代道德盲点影响的缺陷。为赋能进步对齐研究，我们引入了ProgressGym——一个允许从历史中学习道德进步机制的实验框架，旨在促进现实世界道德决策的未来进步。利用9个世纪的历史文本和18个历史LLMs，ProgressGym能够将现实世界的进步对齐挑战编码为具体的基准测试。具体而言，我们提出了三项核心挑战：追踪演变中的价值观（PG-Follow）、预先预测道德进步（PG-Predict），以及调节人类与人工智能价值观转变之间的反馈循环（PG-Coevolve）。缺乏时间维度的对齐方法无法适用于这些任务。为此，我们提出了终身学习和外推算法作为进步对齐的基线方法，并建立了开放排行榜以征集新算法和挑战。该框架与排行榜分别发布于 https://github.com/PKU-Alignment/ProgressGym 与 https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard。