Training Data Attribution (TDA) seeks to trace model predictions back to influential training examples, enhancing interpretability and safety. We formulate TDA as a Bayesian information-theoretic problem: subsets are scored by the information loss they induce - the entropy increase at a query when removed. This criterion credits examples for resolving predictive uncertainty rather than label noise. To scale to modern networks, we approximate information loss using a Gaussian Process surrogate built from tangent features. We show this aligns with classical influence scores for single-example attribution while promoting diversity for subsets. For even larger-scale retrieval, we relax to an information-gain objective and add a variance correction for scalable attribution in vector databases. Experiments show competitive performance on counterfactual sensitivity, ground-truth retrieval and coreset selection, showing that our method scales to modern architectures while bridging principled measures with practice.
翻译:训练数据归因(TDA)旨在将模型预测追溯至有影响力的训练样本,从而增强可解释性与安全性。本文将TDA形式化为贝叶斯信息论问题:通过子集移除时查询点的熵增(即所引起的信息损失)对子集进行评分。该准则奖励能解决预测不确定性而非标签噪声的样本。为适配现代网络,我们利用切线特征构建的高斯过程代理模型来近似信息损失。研究表明,该方法在单样本归因中与经典影响分数一致,同时在子集场景中促进多样性。针对更大规模的检索任务,我们松弛为信息增益目标,并引入方差校正项,实现向量数据库中的可扩展归因。实验表明,本方法在反事实敏感性、真实标签检索及核心集选取等任务上表现优异,在连接理论度量与实际应用的同时,可扩展至现代架构。