In evaluation campaigns, participants often explore variations of popular, state-of-the-art baselines as a low-risk strategy to achieve competitive results. While effective, this can lead to local "hill climbing" rather than more radical and innovative departure from standard methods. Moreover, if many participants build on similar baselines, the overall diversity of approaches considered may be limited. In this work, we propose a new class of IR evaluation metrics intended to promote greater diversity of approaches in evaluation campaigns. Whereas traditional IR metrics focus on user experience, our two "innovation" metrics instead reward exploration of more divergent, higher-risk strategies finding relevant documents missed by other systems. Experiments on four TREC collections show that our metrics do change system rankings by rewarding systems that find such rare, relevant documents. This result is further supported by a controlled, synthetic data experiment, and a qualitative analysis. In addition, we show that our metrics achieve higher evaluation stability and discriminative power than the standard metrics we modify. To support reproducibility, we share our source code.
翻译:在评测活动中,参与者常将流行且先进的基线模型变体作为低风险策略,以期获得具有竞争力的结果。尽管这种策略有效,却可能导致局部"爬山优化"而非对标准方法的根本性创新突破。更值得注意的是,当多数参与者基于相似基线模型开展研究时,整个方法探索空间的多样性将受到显著制约。本研究提出了一类新型信息检索评估指标,旨在提升评测活动中方法的多样性。与关注用户体验的传统信息检索指标不同,我们提出的两种"创新"指标通过奖励那些能够发现其他系统遗漏的相关文档的高风险差异化策略,鼓励探索更富创新性的方法。在四个TREC语料库上的实验表明,通过奖励发现稀缺相关文档的系统,我们的指标确实改变了系统排名。这一结果得到了受控合成数据实验与定性分析的进一步支持。此外,相较于被改进的标准指标,我们的指标展现出更高的评估稳定性与判别力。为保障可复现性,我们已公开源代码。