We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions. To validate our proposal, we run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players. We find that the AI assistant accurately aligns its behavior to match standard policies from the economic literature (e.g., selfish, altruistic). However, the assistant's learned policies lack robustness and exhibit limited generalization in an out-of-distribution setting when confronted with a currency (e.g., grams of medicine) that was not included in the assistant's training distribution. Additionally, we find that when there is inconsistency in the relationship between language use and an unknown policy (e.g., an altruistic policy combined with rude language), the assistant's learning of the policy is slowed. Overall, our preliminary results suggest that developing simulation frameworks in which AI assistants need to infer preferences from diverse users can provide a valuable approach for studying practical alignment questions.
翻译:我们探索了通过从观察到的交互中逆向建模用户(未知)偏好来对齐AI助手的思路。为验证该方案,我们在经济学的最后通牒博弈中进行了概念验证模拟,将用户偏好形式化为指导模拟玩家行为的策略。研究发现,AI助手能准确调整其行为以匹配经济学文献中的标准策略(如自私策略、利他策略)。然而,当面对未包含在助手训练分布中的货币(如药品克数)时,习得策略缺乏鲁棒性且泛化能力有限。此外,当语言使用与未知策略存在不一致关系时(例如利他策略搭配粗鲁语言),助手学习该策略的速度会减慢。总体而言,我们的初步结果表明,构建AI助手需从多样用户推断偏好的模拟框架,可为研究实际对齐问题提供有价值的途径。