Sample selection improves the efficiency and effectiveness of machine learning models by providing informative and representative samples. Typically, samples can be modeled as a sample graph, where nodes are samples and edges represent their similarities. Most existing methods are based on local information, such as the training difficulty of samples, thereby overlooking global information, such as connectivity patterns. This oversight can result in suboptimal selection because global information is crucial for ensuring that the selected samples well represent the structural properties of the graph. To address this issue, we employ structural entropy to quantify global information and losslessly decompose it from the whole graph to individual nodes using the Shapley value. Based on the decomposition, we present $\textbf{S}$tructural-$\textbf{E}$ntropy-based sample $\textbf{S}$election ($\textbf{SES}$), a method that integrates both global and local information to select informative and representative samples. SES begins by constructing a $k$NN-graph among samples based on their similarities. It then measures sample importance by combining structural entropy (global metric) with training difficulty (local metric). Finally, SES applies importance-biased blue noise sampling to select a set of diverse and representative samples. Comprehensive experiments on three learning scenarios -- supervised learning, active learning, and continual learning -- clearly demonstrate the effectiveness of our method.
翻译:样本选择通过提供信息丰富且具有代表性的样本来提升机器学习模型的效率与有效性。通常,样本可建模为样本图,其中节点表示样本,边表示样本间的相似性。现有方法大多基于局部信息(如样本的训练难度),从而忽略了全局信息(如连通模式)。这种忽视可能导致次优选择,因为全局信息对于确保所选样本能良好表征图的结构特性至关重要。为解决此问题,我们采用结构熵来量化全局信息,并利用Shapley值将其从整个图无损分解至各个节点。基于该分解,我们提出了基于结构熵的样本选择方法(SES),该方法整合全局与局部信息以选择信息丰富且具有代表性的样本。SES首先基于样本相似性构建k近邻图,随后通过结合结构熵(全局度量)与训练难度(局部度量)来评估样本重要性,最后应用重要性偏置蓝噪声采样来选择一组多样且具代表性的样本。在监督学习、主动学习和持续学习三种学习场景下的综合实验,清晰验证了本方法的有效性。