Recent studies have shown that prompting can enable large language models (LLMs) to simulate specific personality traits and produce behaviors that align with those traits. However, there is limited understanding of how these simulated personalities influence critical web search decisions, specifically relevance assessment. Moreover, few studies have examined how simulated personalities impact confidence calibration, specifically the tendencies toward overconfidence or underconfidence. This gap exists even though psychological literature suggests these biases are trait-specific, often linking high extraversion to overconfidence and high neuroticism to underconfidence. To address this gap, we conducted a comprehensive study evaluating multiple LLMs, including commercial models and open-source models, prompted to simulate Big Five personality traits. We tested these models across three test collections (TREC DL 2019, TREC DL 2020, and LLMJudge), collecting two key outputs for each query-document pair: a relevance judgment and a self-reported confidence score. The findings show that personalities such as low agreeableness consistently align more closely with human labels than the unprompted condition. Additionally, low conscientiousness performs well in balancing the suppression of both overconfidence and underconfidence. We also observe that relevance scores and confidence distributions vary systematically across different personalities. Based on the above findings, we incorporate personality-conditioned scores and confidence as features in a random forest classifier. This approach achieves performance that surpasses the best single-personality condition on a new dataset (TREC DL 2021), even with limited training data. These findings highlight that personality-derived confidence offers a complementary predictive signal, paving the way for more reliable and human-aligned LLM evaluators.
翻译:近期研究表明,通过提示工程可使大语言模型模拟特定人格特质,并产生符合该特质的行为表现。然而,这些模拟人格如何影响网络搜索中的关键决策——特别是相关性评估——尚缺乏深入理解。此外,关于模拟人格如何影响置信度校准(尤其是过度自信或自信不足的倾向)的研究也较为有限。尽管心理学文献指出这些偏差具有特质特异性(例如高外向性常与过度自信相关,高神经质常与自信不足相关),该研究空白仍然存在。为填补这一空白,我们开展了一项综合性研究,评估了包括商业模型和开源模型在内的多个大语言模型在模拟大五人格特质时的表现。我们在三个测试集(TREC DL 2019、TREC DL 2020和LLMJudge)上对这些模型进行测试,针对每个查询-文档对收集两个关键输出:相关性判断和自报告的置信度分数。研究发现,相较于无提示条件,低宜人性等人格特质能持续产生更接近人类标注的结果。此外,低尽责性在抑制过度自信与自信不足方面表现出良好的平衡能力。我们还观察到,不同人格特质下的相关性分数和置信度分布存在系统性差异。基于上述发现,我们将人格条件化的分数和置信度作为特征输入随机森林分类器。该方法在新数据集(TREC DL 2021)上取得的性能超越了最佳单一人格条件,即使在训练数据有限的情况下仍保持优势。这些发现表明,人格衍生的置信度能提供互补的预测信号,为开发更可靠且符合人类认知的大语言模型评估器开辟了新路径。