Causally-Guided Diffusion for Stable Feature Selection

Feature selection is fundamental to robust data-centric AI, but most existing methods optimize predictive performance under a single data distribution. This often selects spurious features that fail under distribution shifts. Motivated by principles from causal invariance, we study feature selection from a stability perspective and introduce Causally-Guided Diffusion for Stable Feature Selection (CGDFS). In CGDFS, we formalized feature selection as approximate posterior inference over feature subsets, whose posterior mass favors low prediction error and low cross-environment variance. Our framework combines three key insights: First, we formulate feature selection as stability-aware posterior sampling. Here, causal invariance serves as a soft inductive bias rather than explicit causal discovery. Second, we train a diffusion model as a learned prior over plausible continuous selection masks, combined with a stability-aware likelihood that rewards invariance across environments. This diffusion prior captures structural dependencies among features and enables scalable exploration of the combinatorially large selection space. Third, we perform guided annealed Langevin sampling that combines the diffusion prior with the stability objective, which yields a tractable, uncertainty-aware posterior inference that avoids discrete optimization and produces robust feature selections. We evaluate CGDFS on open-source real-world datasets exhibiting distribution shifts. Across both classification and regression tasks, CGDFS consistently selects more stable and transferable feature subsets, which leads to improved out-of-distribution performance and greater selection robustness compared to sparsity-based, tree-based, and stability-selection baselines.

翻译：特征选择是实现以数据为中心的鲁棒人工智能的基础，但现有大多数方法优化的是单一数据分布下的预测性能。这往往会导致选择出在分布偏移下失效的虚假特征。受因果不变性原理的启发，我们从稳定性角度研究特征选择问题，并提出因果引导扩散稳定特征选择方法（Causally-Guided Diffusion for Stable Feature Selection, CGDFS）。在CGDFS中，我们将特征选择形式化为特征子集上的近似后验推断，其后验质量倾向于低预测误差和低跨环境方差。我们的框架融合了三个关键创新点：第一，将特征选择建模为稳定性感知的后验采样，其中因果不变性作为软归纳偏置而非显式因果发现手段；第二，训练扩散模型作为连续选择掩码的先验分布学习器，结合奖励跨环境不变性的稳定性感知似然函数，该扩散先验能捕捉特征间的结构依赖关系，实现组合爆炸式选择空间的可扩展探索；第三，执行将扩散先验与稳定性目标相结合的引导退火朗之万采样，获得可处理的、具有不确定性感知的后验推断，避免离散优化并产生鲁棒的特征选择结果。我们在呈现分布偏移的开源真实世界数据集上评估了CGDFS。在分类与回归任务中，相较于基于稀疏性、基于树模型和基于稳定性的基线方法，CGDFS始终能选择出更稳定、可迁移性更强的特征子集，从而提升分布外性能表现并增强选择鲁棒性。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

ICML 2026 | 演化选择的因果建模

专知会员服务

12+阅读 · 6月5日

因果决策综述

专知会员服务

51+阅读 · 2025年3月1日

基于因果推断的推荐系统去偏研究

专知会员服务

21+阅读 · 2024年11月10日

【牛津大学博士论文】观察性因果机器学习中的结构性和统计不确定性

专知会员服务

31+阅读 · 2024年9月24日