Cranfield-style retrieval evaluations with too few or too many relevant documents or with low inter-assessor agreement on relevance can reduce the reliability of observations. In evaluations with human assessors, information needs are often formalized as retrieval topics to avoid an excessive number of relevant documents while maintaining good agreement. However, emerging evaluation setups that use Large Language Models (LLMs) as relevance assessors often use only queries, potentially decreasing the reliability. To study whether LLM relevance assessors benefit from formalized information needs, we synthetically formalize information needs with LLMs into topics that follow the established structure from previous human relevance assessments (i.e., descriptions and narratives). We compare assessors using synthetically formalized topics against the LLM-default query-only assessor on the~2019/2020~editions of TREC Deep Learning and Robust04. We find that assessors without formalization judge many more documents relevant and have a lower agreement, leading to reduced reliability in retrieval evaluations. Furthermore, we show that the formalized topics improve agreement between human and LLM relevance judgments, even when the topics are not highly similar to their human counterparts. Our findings indicate that LLM relevance assessors should use formalized information needs, as is standard for human assessment, and synthetically formalize topics when no human formalization exists to improve evaluation reliability.
翻译:克兰菲尔德风格的检索评估如果相关文档过多或过少,或评估者间对相关性的判断一致性较低,可能会降低观察结果的可靠性。在人类评估者的评估中,信息需求通常被形式化为检索主题,以避免过多相关文档,同时保持良好的一致性。然而,当前使用大语言模型作为相关性评估者的新兴评估设置往往仅使用查询,这可能降低评估的可靠性。为探究大语言模型评估者是否能从形式化信息需求中受益,我们利用大语言模型将信息需求合成地形式化为遵循先前人类相关性评估中确立结构的主题(即描述和叙述)。我们在TREC Deep Learning 2019/2020版本和Robust04上,将使用合成形式化主题的评估者与大语言模型默认的仅查询评估者进行比较。我们发现,未进行形式化的评估者判定更多文档为相关且一致性较低,导致检索评估的可靠性下降。此外,我们证明即使形式化主题与人类对应主题的相似度不高,也能改善人类与大语言模型相关性判断之间的一致性。我们的研究结果表明,大语言模型相关性评估者应使用形式化信息需求(这是人类评估的标准做法),并在缺乏人类形式化时合成地生成主题,以提高评估可靠性。