We investigate the problem of selecting features for datasets that can be naturally partitioned into subgroups (e.g., according to socio-demographic groups and age), each with its own dominant set of features. Within this subgroup-oriented framework, we address the challenge of systematic missing data, a scenario in which some feature values are missing for all tuples of a subgroup, due to flawed data integration, regulatory constraints, or privacy concerns. Feature selection is governed by finding mutual Information, a popular quantification of correlation, between features and a target variable. Our goal is to identify top-K feature subsets of some fixed size with the highest joint mutual information with a target variable. In the presence of systematic missing data, the closed form of mutual information could not simply be applied. We argue that in such a setting, leveraging relationships between available feature mutual information within a subgroup or across subgroups can assist inferring missing mutual information values. We propose a generalizable model based on heterogeneous graph neural network to identify interdependencies between feature-subgroup-target variable connections by modeling it as a multiplex graph, and employing information propagation between its nodes. We address two distinct scalability challenges related to training and propose principled solutions to tackle them. Through an extensive empirical evaluation, we demonstrate the efficacy of the proposed solutions both qualitatively and running time wise.
翻译:本研究探讨了在数据集可自然划分为多个子组(例如根据社会人口统计群体和年龄划分)且每个子组拥有其主导特征集的情况下,特征选择问题。在此面向子组的框架中,我们应对系统性缺失数据的挑战——即由于数据集成缺陷、监管限制或隐私问题,导致某个子组的所有元组均缺失部分特征值。特征选择通过寻找特征与目标变量之间的互信息(一种常用的相关性量化指标)来实现。我们的目标是识别出与目标变量具有最高联合互信息的固定大小的前K个特征子集。在存在系统性缺失数据的情况下,互信息的闭式表达无法直接应用。我们认为在此类场景中,利用子组内或跨子组间可用特征互信息的关系,可辅助推断缺失的互信息值。我们提出了一种基于异构图神经网络的通用模型,通过将特征-子组-目标变量关联建模为多重图,并在其节点间进行信息传播,以识别这些关联间的相互依赖关系。我们解决了与训练相关的两个可扩展性挑战,并提出了原则性解决方案。通过大量实证评估,我们从定性和运行时间两个维度验证了所提方案的有效性。