Validating user simulation is a difficult task due to the lack of established measures and benchmarks, which makes it challenging to assess whether a simulator accurately reflects real user behavior. As part of the Sim4IA Micro-Shared Task at the Sim4IA Workshop, SIGIR 2025, we present Sim4IA-Bench, a simulation benchmark suit for the prediction of the next queries and utterances, the first of its kind in the IR community. Our dataset as part of the suite comprises 160 real-world search sessions from the CORE search engine. For 70 of these sessions, up to 62 simulator runs are available, divided into Task A and Task B, in which different approaches predicted users next search queries or utterances. Sim4IA-Bench provides a basis for evaluating and comparing user simulation approaches and for developing new measures of simulator validity. Although modest in size, the suite represents the first publicly available benchmark that links real search sessions with simulated next-query predictions. In addition to serving as a testbed for next query prediction, it also enables exploratory studies on query reformulation behavior, intent drift, and interaction-aware retrieval evaluation. We also introduce a new measure for evaluating next-query predictions in this task. By making the suite publicly available, we aim to promote reproducible research and stimulate further work on realistic and explainable user simulation for information access: https://github.com/irgroup/Sim4IA-Bench.
翻译:由于缺乏公认的衡量标准与基准,验证用户模拟一直是一项困难的任务,这使得评估模拟器是否准确反映真实用户行为颇具挑战性。作为SIGIR 2025 Sim4IA研讨会中Sim4IA微共享任务的一部分,我们提出了Sim4IA-Bench,一个用于预测下一查询与话语的模拟基准套件,这是信息检索社区内首个此类基准。作为该套件一部分的数据集包含来自CORE搜索引擎的160个真实世界搜索会话。其中70个会话提供了多达62次模拟器运行结果,分为任务A与任务B,分别对应不同方法对用户下一搜索查询或话语的预测。Sim4IA-Bench为评估和比较用户模拟方法以及开发新的模拟器有效性度量提供了基础。尽管规模适中,该套件是首个公开可用的、将真实搜索会话与模拟的下一查询预测相关联的基准。除了作为下一查询预测的测试平台外,它还支持对查询重构行为、意图漂移以及交互感知检索评估的探索性研究。我们还为此任务引入了一种新的用于评估下一查询预测的度量方法。通过公开提供此套件,我们旨在促进可重复研究,并激励在面向信息访问的、真实且可解释的用户模拟方面开展进一步工作:https://github.com/irgroup/Sim4IA-Bench。