Eigen-Value: Efficient Domain-Robust Data Valuation via Eigenvalue-Based Approach

Data valuation has become central in the era of data-centric AI. It drives efficient training pipelines and enables objective pricing in data markets by assigning a numeric value to each data point. Most existing data valuation methods estimate the effect of removing individual data points by evaluating changes in model validation performance under in-distribution (ID) settings, as opposed to out-of-distribution (OOD) scenarios where data follow different patterns. Since ID and OOD data behave differently, data valuation methods based on ID loss often fail to generalize to OOD settings, particularly when the validation set contains no OOD data. Furthermore, although OOD-aware methods exist, they involve heavy computational costs, which hinder practical deployment. To address these challenges, we introduce \emph{Eigen-Value} (EV), a plug-and-play data valuation framework for OOD robustness that uses only an ID data subset, including during validation. EV provides a new spectral approximation of domain discrepancy, which is the gap of loss between ID and OOD using ratios of eigenvalues of ID data's covariance matrix. EV then estimates the marginal contribution of each data point to this discrepancy via perturbation theory, alleviating the computational burden. Subsequently, EV plugs into ID loss-based methods by adding an EV term without any additional training loop. We demonstrate that EV achieves improved OOD robustness and stable value rankings across real-world datasets, while remaining computationally lightweight. These results indicate that EV is practical for large-scale settings with domain shift, offering an efficient path to OOD-robust data valuation.

翻译：摘要：数据估值已成为数据驱动人工智能时代的关键环节。它通过为每个数据点分配数值，推动高效训练流程并实现数据市场中的客观定价。现有的大多数数据估值方法通过评估在领域内（ID）设置下移除单个数据点对模型验证性能的影响来估算其效果，而非数据遵循不同分布的领域外（OOD）场景。由于领域内与领域外数据表现不同，基于领域内损失的数据估值方法通常难以推广至领域外设置，尤其是当验证集不包含领域外数据时。此外，尽管存在领域外感知方法，但其高昂的计算成本阻碍了实际部署。为解决这些挑战，我们提出了Eigen-Value（EV），一种即插即用的数据估值框架，仅需使用领域内数据子集（包括验证阶段）即可实现领域外鲁棒性。EV通过利用领域内数据协方差矩阵的特征值比值，提出了一种领域差异的光谱近似方法，该差异即领域内与领域外损失之间的差距。随后，EV基于扰动理论估算每个数据点对该差异的边际贡献，从而减轻计算负担。进一步，EV通过添加一个无需额外训练循环的EV项，可嵌入基于领域内损失的方法中。我们证明，EV在实际数据集上实现了更优的领域外鲁棒性和稳定的价值排序，同时保持计算轻量级。这些结果表明，EV适用于存在领域偏移的大规模场景，为领域外鲁棒的数据估值提供了一条高效路径。