Validating Synthetic Usage Data in Living Lab Environments

Evaluating retrieval performance without editorial relevance judgments is challenging, but instead, user interactions can be used as relevance signals. Living labs offer a way for small-scale platforms to validate information retrieval systems with real users. If enough user interaction data are available, click models can be parameterized from historical sessions to evaluate systems before exposing users to experimental rankings. However, interaction data are sparse in living labs, and little is studied about how click models can be validated for reliable user simulations when click data are available in moderate amounts. This work introduces an evaluation approach for validating synthetic usage data generated by click models in data-sparse human-in-the-loop environments like living labs. We ground our methodology on the click model's estimates about a system ranking compared to a reference ranking for which the relative performance is known. Our experiments compare different click models and their reliability and robustness as more session log data becomes available. In our setup, simple click models can reliably determine the relative system performance with already 20 logged sessions for 50 queries. In contrast, more complex click models require more session data for reliable estimates, but they are a better choice in simulated interleaving experiments when enough session data are available. While it is easier for click models to distinguish between more diverse systems, it is harder to reproduce the system ranking based on the same retrieval algorithm with different interpolation weights. Our setup is entirely open, and we share the code to reproduce the experiments.

翻译：在没有编辑相关性判断的情况下评估检索性能具有挑战性，但用户交互可作为相关性信号替代。生活实验室为小型平台提供了通过真实用户验证信息检索系统的途径。若拥有足够的用户交互数据，可从历史会话中对点击模型进行参数化，进而在向用户暴露实验排序前评估系统。然而，生活实验室中的交互数据较为稀疏，且关于在中等数量点击数据可用时如何验证点击模型以进行可靠用户模拟的研究仍较少。本研究提出了一种在数据稀疏的人机协同环境（如生活实验室）中验证由点击模型生成的合成使用数据的评估方法。我们将方法论建立在点击模型对系统排序的估计基础上，并与已知相对性能的参考排序进行对比。实验比较了不同点击模型及其随会话日志数据增加而变化的可靠性与鲁棒性。在我们的设置中，对于50个查询，简单点击模型仅需20条已记录的会话日志即可可靠确定系统相对性能。相比之下，复杂点击模型需要更多会话数据才能获得可靠估计，但在拥有足够会话数据时，它们在模拟交织实验中是更优选择。虽然点击模型更容易区分多样性更高的系统，但要基于相同检索算法通过不同插值权重复现系统排序则更为困难。我们的实验设置完全开源，并提供重现实验的代码。