Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We demonstrate simple prompting strategies that mitigate these limitations. Across two datasets--a U.S.-representative set of 144 moral scenarios and 38 moral beliefs from the International Social Survey Programme's Family and Changing Gender Roles module covering 32 countries--we show how simple elicitation techniques help improve AI-human alignment. First, prompting models to report standard deviations and response proportions recovers the full range of human responses better than common strategies. Second, ensuring scenarios are clear to human participants--as reflected in human confusion ratings--boosts model alignment, and LLMs can track human confusion ratings. At the same time, we find that LLMs' estimates of their own error are poorly calibrated, though they can predict human variability relatively well. These results suggest that asking better questions to LLMs can yield better answers.
翻译:大型语言模型(LLMs)在捕捉人类判断方面是否存在缺陷?两个常见的局限性是:LLMs无法完整呈现人类反应的分布特征,且其判断随措辞变化表现出不稳定性。我们提出能缓解这些局限性的简单提示策略。基于两个数据集——涵盖美国代表性的144个道德场景与国际社会调查项目"家庭与性别角色变迁"模块中32个国家的38项道德信念——我们展示了简单诱导技术如何提升AI与人类判断的一致性。首先,通过提示模型报告标准差与反应比例,能比常规策略更完整地还原人类反应分布范围。其次,确保场景描述对人类参与者来说清晰易懂(如人类困惑度评分所反映的)能增强模型对齐效果,且LLMs可追踪人类困惑度评分。与此同时,我们发现虽然LLMs能相对准确地预测人类变异性,但其对自身误差的估计校准性较差。这些结果表明,对LLMs提出更优质的问题能获得更优质的答案。