All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

To evaluate whether LLMs can accurately predict future events, we need the ability to \textit{backtest} them on events that have already resolved. This requires models to reason only with information available at a specified past date. Yet LLMs may inadvertently leak post-cutoff knowledge encoded during training, undermining the validity of retrospective evaluation. We introduce a claim-level framework for detecting and quantifying this \emph{temporal knowledge leakage}. Our approach decomposes model rationales into atomic claims and categorizes them by temporal verifiability, then applies \textit{Shapley values} to measure each claim's contribution to the prediction. This yields the \textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate (\textbf{Shapley-DCLR}), an interpretable metric that captures what fraction of decision-driving reasoning derives from leaked information. Building on this framework, we propose \textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims (\textbf{TimeSPEC}), which interleaves generation with claim verification and regeneration to proactively filter temporal contamination -- producing predictions where every supporting claim can be traced to sources available before the cutoff date. Experiments on 350 instances spanning U.S. Supreme Court case prediction, NBA salary estimation, and stock return ranking reveal substantial leakage in standard prompting baselines. TimeSPEC reduces Shapley-DCLR while preserving task performance, demonstrating that explicit, interpretable claim-level verification outperforms prompt-based temporal constraints for reliable backtesting.

翻译：为评估大语言模型能否准确预测未来事件，我们需要能够对已发生事件进行回测。这要求模型仅依据特定历史日期前可获得的信息进行推理。然而，大语言模型可能在训练中无意间泄露截止日期后的知识，从而破坏回顾性评估的有效性。本文提出一种用于检测和量化这种时间知识泄露的声明级框架。我们的方法将模型推理过程分解为原子声明，按时间可验证性进行分类，并应用沙普利值衡量每个声明对预测的贡献度。由此得到沙普利加权决策关键泄露率，这是一个可解释的度量指标，用于量化驱动决策的推理中有多大比例源于泄露信息。基于此框架，我们提出基于时间监督的声明提取预测方法，该方法通过交替进行生成、声明验证与再生成，主动过滤时间污染——确保每个支撑性声明均可追溯至截止日期前可获取的来源。在美国最高法院案件预测、NBA薪资估算和股票收益排序三个领域共350个实例上的实验表明，标准提示基线存在显著泄露。TimeSPEC在保持任务性能的同时有效降低了Shapley-DCLR，证明显式、可解释的声明级验证方法在可靠回测中优于基于提示的时间约束方法。