High-dimensional biomedical studies require models that are simultaneously accurate, sparse, and interpretable, yet exact best subset selection for generalized linear models is computationally intractable. We develop a scalable method that combines a continuous Boolean relaxation of the subset problem with a Frank--Wolfe algorithm driven by envelope gradients. The resulting method, which we refer to as COMBSS-GLM, is simple to implement, requires one penalized generalized linear model fit per iteration, and produces sparse models along a model-size path. Theoretically, we identify a curvature-based parameter regime in which the relaxed objective is concave in the selection weights, implying that global minimizers occur at binary corners. Empirically, in logistic and multinomial simulations across low- and high-dimensional correlated settings, the proposed method consistently improves variable-selection quality relative to established penalised likelihood competitors while maintaining strong predictive performance. In biomedical applications, it recovers established loci in a binary-outcome rice genome-wide association study and achieves perfect multiclass test accuracy on the Khan SRBCT cancer dataset using a small subset of genes. Open-source implementations are available in R at https://github.com/benoit-liquet/COMBSS-GLM-R and in Python at https://github.com/saratmoka/COMBSS-GLM-Python.
翻译:高维生物医学研究需要同时具备准确性、稀疏性和可解释性的模型,然而广义线性模型的精确最优子集选择在计算上难以实现。我们提出一种可扩展方法,将子集问题的连续布尔松弛与基于包络梯度的Frank-Wolfe算法相结合。该方法称为COMBSS-GLM,实现简便,每次迭代仅需拟合一次带惩罚的广义线性模型,并能沿模型规模路径生成稀疏模型。理论上,我们识别了一种基于曲率的参数区间,在该区间内松弛目标函数在选择权重上呈凹性,这意味着全局最优解出现在二元角点。实证方面,在低维和高维相关场景的逻辑回归与多项模拟中,所提方法相较于传统惩罚似然竞争方法持续提升变量选择质量,同时保持强劲的预测性能。在生物医学应用中,该方法在一项二分类水稻全基因组关联研究中恢复了已知遗传位点,并在Khan SRBCT癌症数据集上使用少量基因实现了完美的多类测试准确率。开源实现可于R语言(https://github.com/benoit-liquet/COMBSS-GLM-R)和Python语言(https://github.com/saratmoka/COMBSS-GLM-Python)获取。