Representative Selection (RS) is the problem of finding a small subset of exemplars from a dataset that is representative of the dataset. In this paper, we study RS for attributed graphs, and focus on finding representative nodes that optimize the accuracy of a model trained on the selected representatives. Theoretically, we establish a new hardness result forRS (in the absence of a graph structure) by proving that a particular, highly practical variant of it (RS for Learning) is hard to approximate in polynomial time within any reasonable factor, which implies a significant potential gap between the optimum solution of widely-used surrogate functions and the actual accuracy of the model. We then study the setting where a (homophilous) graph structure is available, or can be constructed, between the data points.We show that with an appropriate modeling approach, the presence of such a structure can turn a hard RS (for learning) problem into one that can be effectively solved. To this end, we develop RS-GNN, a representation learning-based RS model based on Graph Neural Networks. Empirically, we demonstrate the effectiveness of RS-GNN on problems with predefined graph structures as well as problems with graphs induced from node feature similarities, by showing that RS-GNN achieves significant improvements over established baselines on a suite of eight benchmarks.
翻译:代表性选择(RS)是从数据集中寻找能够代表整个数据集的小规模样本子集的问题。本文研究属性图上的代表性选择问题,重点关注如何选取代表性节点以优化基于所选代表训练的模型精度。理论上,我们通过证明一个特定且高度实用的RS变体(面向学习的RS)在多项式时间内无法以任何合理因子近似求解,建立了图结构缺失条件下RS问题的新硬度结果,这表明广泛使用的替代函数最优解与模型实际精度之间存在显著潜在差距。随后,我们研究了数据点之间存在(或可构建)同质性图结构的情形。研究表明,通过恰当的建模方法,这种结构的存在能够将困难的面向学习的RS问题转化为可有效求解的问题。为此,我们提出RS-GNN——一种基于图神经网络的表示学习RS模型。实验方面,我们在八个基准数据集上展示RS-GNN在具有预定义图结构的问题以及基于节点特征相似性诱导图结构的问题中的有效性,证明其相较于现有基线方法取得了显著性能提升。