Active Learning with Low-Rank Structure for Data Selection

In the data selection problem, the objective is to choose a small, representative subset of data that can be used to efficiently train a machine learning model. Sener and Savarese [ICLR 2018] showed that, given an embedding representation of the data and suitable geometric assumptions, heuristics based on $k$-center clustering can be used to perform data selection. This perspective was further explored by Axiotis et. al. [ICML 2024], who proposed a data selection approach based on $k$-means clustering and sensitivity sampling. However, these methods rely on the assumption that the dataset exhibits intrinsic geometric structure that can be effectively captured by clustering, whereas many modern datasets instead possess global algebraic structure that is better exploited by low-rank approximation or principal component analysis. In this paper, we introduce a new data selection framework based on low-rank approximation and residual-based sampling, formulated through the lens of row subset selection and loss-preserving coreset construction. Given an embedding representation of the data satisfying mild regularity conditions, which can be interpreted as algebraic or angular notions of Lipschitz continuity, we show that it is possible to select a weighted subset of $\tilde{O}\left(k + \frac{1}{\varepsilon^2}\right)$ data points whose average loss approximates the average loss over the full dataset within a $(1+\varepsilon)$ relative error, up to an additive $\varepsilon Φ_k$ term, where $Φ_k$ denotes the optimal rank-$k$ approximation cost of the embedding matrix. We complement these theoretical guarantees with empirical evaluations, demonstrating that on a range of real-world datasets, our data selection approach achieves improved performance over prior strategies based on uniform sampling or clustering-based sensitivity sampling.

翻译：在数据选择问题中，目标是从数据中选取一个小的、具有代表性的子集，以用于高效训练机器学习模型。Sener 和 Savarese [ICLR 2018] 表明，在给定数据嵌入表示和适当几何假设的条件下，基于 $k$-中心聚类的启发式方法可用于执行数据选择。Axiotis 等人 [ICML 2024] 进一步探索了这一视角，提出了一种基于 $k$-均值聚类和敏感性采样的数据选择方法。然而，这些方法依赖于数据集具有可通过聚类有效捕捉的内在几何结构的假设，而许多现代数据集反而具有全局代数结构，这种结构更适合通过低秩近似或主成分分析来利用。在本文中，我们引入了一种新的基于低秩近似和残差采样的数据选择框架，该框架通过行子集选择和保损核心集构建的视角进行公式化。给定满足温和正则性条件（可解释为代数或角度意义上的Lipschitz连续性）的数据嵌入表示，我们证明可以选取一个加权子集，大小为 $\tilde{O}\left(k + \frac{1}{\varepsilon^2}\right)$，其平均损失在 $(1+\varepsilon)$ 相对误差范围内近似整个数据集上的平均损失，并附加一个可加项 $\varepsilon Φ_k$，其中 $Φ_k$ 表示嵌入矩阵的最优秩-$k$ 近似代价。我们通过经验评估补充了这些理论保证，表明在一系列真实世界数据集上，我们的数据选择方法相较于基于均匀采样或基于聚类的敏感性采样的先前策略，实现了改进的性能。