Language models (LMs) are increasingly used to simulate human-like responses in scenarios where accurately mimicking a population's behavior can guide decision-making, such as in developing educational materials and designing public policies. The objective of these simulations is for LMs to capture the variations in human responses, rather than merely providing the expected correct answers. Prior work has shown that LMs often generate unrealistically accurate responses, but there are no established metrics to quantify how closely the knowledge distribution of LMs aligns with that of humans. To address this, we introduce "psychometric alignment," a metric that measures the extent to which LMs reflect human knowledge distribution. Assessing this alignment involves collecting responses from both LMs and humans to the same set of test items and using Item Response Theory to analyze the differences in item functioning between the groups. We demonstrate that our metric can capture important variations in populations that traditional metrics, like differences in accuracy, fail to capture. We apply this metric to assess existing LMs for their alignment with human knowledge distributions across three real-world domains. We find significant misalignment between LMs and human populations, though using persona-based prompts can improve alignment. Interestingly, smaller LMs tend to achieve greater psychometric alignment than larger LMs. Further, training LMs on human response data from the target distribution enhances their psychometric alignment on unseen test items, but the effectiveness of such training varies across domains.
翻译:语言模型(LMs)越来越多地被用于模拟人类在各类场景中的反应,在这些场景中,准确模拟人群行为可以指导决策,例如开发教育材料和设计公共政策。这些模拟的目标是让LMs捕捉人类反应的差异,而不仅仅是提供预期的正确答案。先前的研究表明,LMs常常生成不切实际地准确的回答,但目前尚无成熟的指标来量化LMs的知识分布与人类知识分布的接近程度。为此,我们提出了“心理测量对齐”这一指标,用于衡量LMs反映人类知识分布的程度。评估这种对齐性需要收集LMs和人类对同一组测试项目的回答,并使用项目反应理论来分析两组之间项目功能的差异。我们证明,我们的指标能够捕捉传统指标(如准确率差异)无法捕捉到的人群重要差异。我们应用这一指标评估了现有LMs在三个现实世界领域中与人类知识分布的对齐情况。我们发现LMs与人群之间存在显著的不对齐,尽管使用基于人物角色的提示可以改善对齐性。有趣的是,较小的LMs往往比较大的LMs获得更好的心理测量对齐。此外,在目标分布的人类反应数据上训练LMs可以提高它们在未见测试项目上的心理测量对齐性,但这种训练的有效性在不同领域之间存在差异。