Feature selection is fundamental to robust data-centric AI, but most existing methods optimize predictive performance under a single data distribution. This often selects spurious features that fail under distribution shifts. Motivated by principles from causal invariance, we study feature selection from a stability perspective and introduce Causally-Guided Diffusion for Stable Feature Selection (CGDFS). In CGDFS, we formalized feature selection as approximate posterior inference over feature subsets, whose posterior mass favors low prediction error and low cross-environment variance. Our framework combines three key insights: First, we formulate feature selection as stability-aware posterior sampling. Here, causal invariance serves as a soft inductive bias rather than explicit causal discovery. Second, we train a diffusion model as a learned prior over plausible continuous selection masks, combined with a stability-aware likelihood that rewards invariance across environments. This diffusion prior captures structural dependencies among features and enables scalable exploration of the combinatorially large selection space. Third, we perform guided annealed Langevin sampling that combines the diffusion prior with the stability objective, which yields a tractable, uncertainty-aware posterior inference that avoids discrete optimization and produces robust feature selections. We evaluate CGDFS on open-source real-world datasets exhibiting distribution shifts. Across both classification and regression tasks, CGDFS consistently selects more stable and transferable feature subsets, which leads to improved out-of-distribution performance and greater selection robustness compared to sparsity-based, tree-based, and stability-selection baselines.
翻译:特征选择是实现以数据为中心的鲁棒人工智能的基础,但现有大多数方法优化的是单一数据分布下的预测性能。这往往会导致选择出在分布偏移下失效的虚假特征。受因果不变性原理的启发,我们从稳定性角度研究特征选择问题,并提出因果引导扩散稳定特征选择方法(Causally-Guided Diffusion for Stable Feature Selection, CGDFS)。在CGDFS中,我们将特征选择形式化为特征子集上的近似后验推断,其后验质量倾向于低预测误差和低跨环境方差。我们的框架融合了三个关键创新点:第一,将特征选择建模为稳定性感知的后验采样,其中因果不变性作为软归纳偏置而非显式因果发现手段;第二,训练扩散模型作为连续选择掩码的先验分布学习器,结合奖励跨环境不变性的稳定性感知似然函数,该扩散先验能捕捉特征间的结构依赖关系,实现组合爆炸式选择空间的可扩展探索;第三,执行将扩散先验与稳定性目标相结合的引导退火朗之万采样,获得可处理的、具有不确定性感知的后验推断,避免离散优化并产生鲁棒的特征选择结果。我们在呈现分布偏移的开源真实世界数据集上评估了CGDFS。在分类与回归任务中,相较于基于稀疏性、基于树模型和基于稳定性的基线方法,CGDFS始终能选择出更稳定、可迁移性更强的特征子集,从而提升分布外性能表现并增强选择鲁棒性。