Projection predictive variable selection for discrete response families with finite support

from arxiv, This preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections. The Version of Record of this article is published in Computational Statistics, and is available online at https://doi.org/10.1007/s00180-024-01506-0

The projection predictive variable selection is a decision-theoretically justified Bayesian variable selection approach achieving an outstanding trade-off between predictive performance and sparsity. Its projection problem is not easy to solve in general because it is based on the Kullback-Leibler divergence from a restricted posterior predictive distribution of the so-called reference model to the parameter-conditional predictive distribution of a candidate model. Previous work showed how this projection problem can be solved for response families employed in generalized linear models and how an approximate latent-space approach can be used for many other response families. Here, we present an exact projection method for all response families with discrete and finite support, called the augmented-data projection. A simulation study for an ordinal response family shows that the proposed method performs better than or similarly to the previously proposed approximate latent-space projection. The cost of the slightly better performance of the augmented-data projection is a substantial increase in runtime. Thus, in such cases, we recommend the latent projection in the early phase of a model-building workflow and the augmented-data projection for final results. The ordinal response family from our simulation study is supported by both projection methods, but we also include a real-world cancer subtyping example with a nominal response family, a case that is not supported by the latent projection.

翻译：投影预测变量选择是一种基于决策理论的贝叶斯变量选择方法，其在预测性能与稀疏性之间实现了卓越的平衡。其投影问题通常不易求解，因为该问题基于从所谓参考模型的受限后验预测分布到候选模型的参数条件预测分布的Kullback-Leibler散度。先前的研究展示了如何为广义线性模型中使用的响应族求解该投影问题，以及如何通过近似潜空间方法处理许多其他响应族。本文提出了一种适用于所有离散有限支撑响应族的精确投影方法，称为增广数据投影。针对有序响应族的模拟研究表明，所提方法的性能优于或类似于先前提出的近似潜空间投影。增广数据投影在性能上的略微优势以运行时间的显著增加为代价。因此，在此类情况下，我们建议在模型构建工作流的早期阶段使用潜空间投影，而在最终结果阶段采用增广数据投影。模拟研究中的有序响应族同时适用于两种投影方法，但本文还包含了一个具有名义响应族的真实世界癌症亚型分型案例，该案例无法通过潜空间投影进行处理。