We propose TAROT, a targeted data selection framework grounded in optimal transport theory. Previous targeted data selection methods primarily rely on influence-based greedy heuristics to enhance domain-specific performance. While effective on limited, unimodal data (i.e., data following a single pattern), these methods struggle as target data complexity increases. Specifically, in multimodal distributions, these heuristics fail to account for multiple inherent patterns, leading to suboptimal data selection. This work identifies two primary factors contributing to this limitation: (i) the disproportionate impact of dominant feature components in high-dimensional influence estimation, and (ii) the restrictive linear additive assumptions inherent in greedy selection strategies. To address these challenges, TAROT incorporates whitened feature distance to mitigate dominant feature bias, providing a more reliable measure of data influence. Building on this, TAROT uses whitened feature distance to quantify and minimize the optimal transport distance between the selected data and target domains. Notably, this minimization also facilitates the estimation of optimal selection ratios. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state-of-the-art methods, highlighting its versatility across various deep learning tasks. Code is available at https://github.com/vita-epfl/TAROT.
翻译:我们提出了TAROT,一个基于最优传输理论的目标数据选择框架。现有的目标数据选择方法主要依赖基于影响的贪心启发式策略来提升特定领域的性能。虽然这些方法在有限的单模态数据(即遵循单一模式的数据)上有效,但随着目标数据复杂度的增加,它们面临困难。具体而言,在多模态分布中,这些启发式方法未能考虑多个内在模式,导致数据选择效果欠佳。本研究识别了导致这一局限的两个主要因素:(i)高维影响估计中主导特征分量的不成比例影响,以及(ii)贪心选择策略固有的限制性线性可加假设。为应对这些挑战,TAROT引入了白化特征距离以减轻主导特征偏差,从而提供更可靠的数据影响度量。在此基础上,TAROT利用白化特征距离来量化并最小化所选数据与目标域之间的最优传输距离。值得注意的是,该最小化过程也有助于估计最优选择比例。我们在多个任务上评估了TAROT,包括语义分割、运动预测和指令调优。结果一致表明,TAROT优于现有最先进方法,突显了其在各种深度学习任务中的通用性。代码可在https://github.com/vita-epfl/TAROT获取。