When, in terms of the number of data points, the size of a dataset exceeds available computing resources, or when labeling is expensive, an attractive solution consists of selecting only some of the data points (subdata) for further consideration. A central question for selecting subdata of size $n$ from $N$ available data points is which $n$ points to select. While an answer to this question depends on the objective, one approach for a parametric model and a focus on parameter estimation is to select subdata that retains maximal information. Identifying such subdata is a classical NP-hard problem due to its inherent discreteness. Based on optimal approximate design theory, we develop a new methodology for information-based subdata selection, resulting in subdata that approaches the optimal solution. To achieve this, we develop a novel algorithm that applies to a general model, accommodates arbitrary choices of $N$ and $n$, and supports multiple optimality criteria, and we prove its convergence. Moreover, the new methodology facilitates an assessment of the efficiency of subdata selected by any method by obtaining tight lower and upper bounds for the efficiency. We show that the subdata obtained through the new methodology is highly efficient and outperforms all existing methods.
翻译:当数据集的数据点数量超出可用计算资源,或标注成本高昂时,一个颇具吸引力的解决方案是仅选择部分数据点(即子数据)进行后续处理。从N个可用数据点中选择规模为n的子数据时,核心问题在于如何确定这n个点的选取方案。虽然该问题的答案取决于具体目标,但对于参数模型并以参数估计为重点而言,一种方法是选择保留最大信息的子数据。由于子数据选择固有的离散性,识别此类子数据是一个经典的NP困难问题。基于最优近似设计理论,我们开发了一种新的基于信息的子数据选择方法,所得子数据趋近于最优解。为此,我们提出了一种适用于一般模型的新型算法,该算法兼容N和n的任意选择,支持多种最优性准则,并证明了其收敛性。此外,新方法通过获取效率的紧上下界,便于评估任何方法所选子数据的效率。研究表明,通过新方法获得的子数据具有高效率,并且优于所有现有方法。