We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions. To validate our proposal, we run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players. We find that the AI assistant accurately aligns its behavior to match standard policies from the economic literature (e.g., selfish, altruistic). However, the assistant's learned policies lack robustness and exhibit limited generalization in an out-of-distribution setting when confronted with a currency (e.g., grams of medicine) that was not included in the assistant's training distribution. Additionally, we find that when there is inconsistency in the relationship between language use and an unknown policy (e.g., an altruistic policy combined with rude language), the assistant's learning of the policy is slowed. Overall, our preliminary results suggest that developing simulation frameworks in which AI assistants need to infer preferences from diverse users can provide a valuable approach for studying practical alignment questions.
翻译:我们探索了通过从观察到的交互中反向推导用户(未知)偏好模型来对齐AI助手的方法。为验证这一设想,我们在经济博弈中的最后通牒游戏中开展了概念验证模拟,将用户偏好形式化为指导模拟玩家行为的策略。研究发现,AI助手能够准确调整其行为以匹配经济学文献中的标准策略(如自私、利他)。然而,当面对训练分布中未曾出现的货币单位(例如药品克数)时,学习到的策略缺乏鲁棒性,在分布外场景中泛化能力有限。此外,当语言使用与未知策略(如利他策略与粗鲁语言结合)的关系存在不一致性时,助手对策略的学习速度会放缓。总体而言,我们的初步结果表明,构建AI助手需从不同用户推断偏好的模拟框架,可为研究实际对齐问题提供有价值的途径。