Selecting interpretable feature sets in underdetermined ($n \ll p$) and highly correlated regimes constitutes a fundamental challenge in data science, particularly when analyzing physical measurements. In such settings, multiple distinct sparse subsets may explain the response equally well. Identifying these alternatives is crucial for generating domain-specific insights into the underlying mechanisms, yet conventional methods typically isolate a single solution, obscuring the full spectrum of plausible explanations. We present GEMSS (Gaussian Ensemble for Multiple Sparse Solutions), a variational Bayesian framework specifically designed to simultaneously discover multiple, diverse sparse feature combinations. The method employs a structured spike-and-slab prior for sparsity, a mixture of Gaussians to approximate the intractable multimodal posterior, and a Jaccard-based penalty to further control solution diversity. Unlike sequential greedy approaches, GEMSS optimizes the entire ensemble of solutions within a single objective function via stochastic gradient descent. The method is validated on a comprehensive benchmark comprising 128 synthetic experiments across classification and regression tasks. Results demonstrate that GEMSS scales effectively to high-dimensional settings ($p=5000$) with sample size as small as $n = 50$, generalizes seamlessly to continuous targets, handles missing data natively, and exhibits remarkable robustness to class imbalance and Gaussian noise. GEMSS is available as a Python package 'gemss' at PyPI. The full GitHub repository at https://github.com/kat-er-ina/gemss/ also includes a free, easy-to-use application suitable for non-coders.
翻译:在欠定($n \ll p$)且高度相关的数据场景中,选择可解释的特征集构成了数据科学中的一个基本挑战,尤其是在分析物理测量数据时。在此类情况下,可能存在多个不同的稀疏特征子集能够同样好地解释响应变量。识别这些替代方案对于生成针对底层机制的领域特定见解至关重要,然而传统方法通常仅分离出单一解,从而掩盖了全部可能的解释谱。我们提出GEMSS(用于多重稀疏解的高斯集成),这是一种专门设计的变分贝叶斯框架,旨在同时发现多个多样化的稀疏特征组合。该方法采用结构化的尖峰-平板先验以实现稀疏性,使用高斯混合来近似难以处理的多峰后验分布,并引入基于Jaccard距离的惩罚项以进一步控制解的多样性。与顺序贪婪方法不同,GEMSS通过随机梯度下降在单一目标函数内优化整个解集成。该方法在一个包含128个分类与回归任务合成实验的综合基准测试中进行了验证。结果表明,GEMSS能够有效扩展至高维场景($p=5000$)且样本量小至$n = 50$,可无缝泛化至连续目标变量,原生处理缺失数据,并对类别不平衡与高斯噪声表现出显著的鲁棒性。GEMSS已作为Python包'gemss'发布于PyPI。完整的GitHub仓库(https://github.com/kat-er-ina/gemss/)还包含一个免费、易用的应用程序,适合非编程人员使用。