Efficient Estimation of Kernel Surrogate Models for Task Attribution

Modern AI agents such as large language models are trained on diverse tasks -- translation, code generation, mathematical reasoning, and text prediction -- simultaneously. A key question is to quantify how each individual training task influences performance on a target task, a problem we refer to as task attribution. The direct approach, leave-one-out retraining, measures the effect of removing each task, but is computationally infeasible at scale. An alternative approach that builds surrogate models to predict a target task's performance for any subset of training tasks has emerged in recent literature. Prior work focuses on linear surrogate models, which capture first-order relationships, but miss nonlinear interactions such as synergy, antagonism, or XOR-type effects. In this paper, we first consider a unified task weighting framework for analyzing task attribution methods, and show a new connection between linear surrogate models and influence functions through a second-order analysis. Then, we introduce kernel surrogate models, which more effectively represent second-order task interactions. To efficiently learn the kernel surrogate, we develop a gradient-based estimation procedure that leverages a first-order approximation of pretrained models; empirically, this yields accurate estimates with less than $2\%$ relative error without repeated retraining. Experiments across multiple domains -- including math reasoning in transformers, in-context learning, and multi-objective reinforcement learning -- demonstrate the effectiveness of kernel surrogate models. They achieve a $25\%$ higher correlation with the leave-one-out ground truth than linear surrogates and influence-function baselines. When used for downstream task selection, kernel surrogate models yield a $40\%$ improvement in demonstration selection for in-context learning and multi-objective reinforcement learning benchmarks.

翻译：现代人工智能代理（如大型语言模型）通常同时在多样化任务上进行训练——包括翻译、代码生成、数学推理和文本预测等。一个关键问题在于量化每个独立训练任务如何影响目标任务的性能表现，我们将此问题称为任务归因。直接方法（留一重训练法）通过移除每个任务来测量其影响，但在大规模场景下计算上不可行。近年来文献中提出了一种替代方案：构建代理模型以预测任意训练任务子集对目标任务的性能影响。已有研究主要聚焦于线性代理模型，这类模型虽能捕捉一阶关系，却无法刻画非线性交互作用（如协同效应、拮抗效应或异或型效应）。本文首先提出一个统一的任务加权框架用于分析任务归因方法，并通过二阶分析揭示了线性代理模型与影响函数之间的新联系。进而，我们引入核代理模型，该模型能更有效地表征二阶任务交互作用。为实现核代理模型的高效学习，我们开发了一种基于梯度的估计流程，该流程利用预训练模型的一阶近似；实证表明，该方法在不进行重复训练的情况下能以低于$2\%$的相对误差获得精确估计。在多个领域的实验中——包括Transformer中的数学推理、上下文学习以及多目标强化学习——核代理模型均展现出卓越性能。相较于线性代理模型和影响函数基线方法，其与留一法基准真值的相关性提升了$25\%$。当应用于下游任务选择时，核代理模型在上下文学习的示例选择任务以及多目标强化学习基准测试中实现了$40\%$的性能提升。