Search-engine date filters are widely used to enforce pre-cutoff retrieval in retrospective evaluations of search-augmented forecasters. We show this approach is unreliable: auditing Google Search with a before: filter, 71% of questions return at least one page containing strong post-cutoff leakage, and for 41%, at least one page directly reveals the answer. Using a large language model (LLM), gpt-oss-120b, to forecast with these leaky documents, we demonstrate an inflated prediction accuracy (Brier score 0.108 vs. 0.242 with leak-free documents). We characterize common leakage mechanisms, including updated articles, related-content modules, unreliable metadata/timestamps, and absence-based signals, and argue that date-restricted search is insufficient for temporal evaluation. We recommend stronger retrieval safeguards or evaluation on frozen, time-stamped web snapshots to ensure credible retrospective forecasting.
翻译:搜索引擎日期过滤器在检索增强预测器的回顾性评估中被广泛用于实施预截止点检索。我们证明这种方法并不可靠:通过使用before:过滤器对谷歌搜索进行审计,71%的问题至少返回一个包含强烈截止点后泄露的页面,且41%的问题至少有一个页面直接揭示了答案。使用大型语言模型(LLM)gpt-oss-120b基于这些泄露文档进行预测,我们展示了被夸大的预测准确率(Brier分数为0.108,而使用无泄露文档时为0.242)。我们描述了常见的泄露机制,包括更新的文章、相关内容模块、不可靠的元数据/时间戳以及基于缺失的信号,并论证日期限制搜索不足以进行时间评估。我们建议采用更强的检索保障措施,或在冻结的、带时间戳的网络快照上进行评估,以确保可靠的回顾性预测。