Information retrieval (IR) benchmarks typically follow the Cranfield paradigm, relying on static and predefined corpora. However, temporal changes in technical corpora, such as API deprecations and code reorganizations, can render existing benchmarks stale. In our work, we investigate how temporal corpus drift affects FreshStack, a retrieval benchmark focused on technical domains. We examine two independent corpus snapshots of FreshStack from October 2024 and October 2025 to answer questions about LangChain. Our analysis shows that all but one query posed in 2024 remain fully supported by the 2025 corpus, as relevant documents "migrate" from LangChain to competitor repositories, such as LlamaIndex. Next, we compare the accuracy of retrieval models on both snapshots and observe only minor shifts in model rankings, with overall strong correlation of up to 0.978 Kendall $τ$ at Recall@50. These results suggest that retrieval benchmarks re-judged with evolving temporal corpora can remain reliable for retrieval evaluation. We publicly release all our artifacts at https://github.com/fresh-stack/driftbench.
翻译:信息检索(IR)基准通常遵循Cranfield范式,依赖于静态且预定义的语料库。然而,技术语料库中的时间性变化(例如API弃用和代码重组)可能导致现有基准过时。在本研究中,我们探讨了时间性语料漂移如何影响专注于技术领域的检索基准FreshStack。我们检查了FreshStack在2024年10月和2025年10月的两个独立语料快照,以回答关于LangChain的问题。我们的分析表明,除一个查询外,2024年提出的所有查询在2025年的语料库中仍得到完全支持,因为相关文档从LangChain“迁移”到了竞争对手的代码库(如LlamaIndex)。接着,我们比较了检索模型在两个快照上的准确性,观察到模型排名仅有微小变化,在Recall@50指标下总体强相关性高达Kendall $τ$ 0.978。这些结果表明,基于随时间演化的语料库重新评估的检索基准,对于检索评估仍可保持可靠性。我们在https://github.com/fresh-stack/driftbench公开发布了所有相关资源。