Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

Data attribution and valuation are critical for understanding data-model synergy for Large Language Models (LLMs), yet existing gradient-based methods suffer from scalability challenges on LLMs. Inspired by human cognition, where decision making relies on a focused readout of relevant memories rather than replaying all pathways, we introduce RISE (Readout Influence Sketching Estimator). Instead of computing and indexing gradients across the entire LLM, RISE focuses on influence hotspots at the output layer, where influence signals concentrate, and the gradient admits a decomposed outer-product form. This enables a dual-channel representation combining a lexical residual channel (RH) and a semantic projected-error channel (GH). Applying CountSketch projections to these channels achieves strong compression while maintaining accurate attribution. Across the OLMo (1B-32B) and Pythia (14M-6.9B) families, RISE reduces index storage by up to 112$\times$ compared to RapidIn and scales to 32B parameters LLM, where gradient-based baselines such as RapidIn and ZO-Inf become memory-infeasible. We evaluate RISE on two paradigms: (1) retrospective attribution, retrieving influential training examples for specific predictions, and (2) prospective valuation, scoring candidate data utility zero-shot. We validate RISE on three tasks: Howdy backdoor data detection, Finance-Medical domain separation, and Brain Rot high-quality data selection. In a closed-loop Brain Rot study, continued pretraining on RISE-selected data yields consistent downstream improvements. Overall, RISE provides a practical and scalable primitive for influence analysis and training-data selection in modern large language models.

翻译：数据归因与估值对于理解大型语言模型（LLMs）中数据与模型的协同作用至关重要，但现有基于梯度的方法在LLMs上存在可扩展性挑战。受人类认知启发——决策依赖于对相关记忆的聚焦读取而非回放全部路径——我们提出RISE（读出示意图影响估计器）。RISE无需计算并索引整个LLM的梯度，而是聚焦于输出层的影响热点区域（影响信号在此集中），且该层梯度具有外积分解形式。这实现了双通道表征：词汇残差通道（RH）与语义投影误差通道（GH）。对这两个通道应用CountSketch投影，可在保持精准归因的同时实现强压缩。在OLMo（1B-32B）和Pythia（14M-69B）系列模型上，RISE相较RapidIn将索引存储量降低达112倍，且可扩展至32B参数LLM——此时RapidIn和ZO-Inf等基于梯度的基线方法已因内存不足而失效。我们从两个范式评估RISE：（1）回溯性归因：检索特定预测结果的关键训练样本；（2）前瞻性估值：对候选数据效用进行零样本评分。我们在三个任务上验证RISE：Howdy后门数据检测、金融-医学领域分离，以及Brain Rot高质量数据筛选。在闭环Brain Rot实验中，基于RISE筛选数据进行持续预训练，在下游任务中取得一致改进。总体而言，RISE为现代大型语言模型的影响分析与训练数据筛选提供了实用且可扩展的基础工具。