The projection predictive variable selection is a decision-theoretically justified Bayesian variable selection approach achieving an outstanding trade-off between predictive performance and sparsity. Its projection problem is not easy to solve in general because it is based on the Kullback-Leibler divergence from a restricted posterior predictive distribution of the so-called reference model to the parameter-conditional predictive distribution of a candidate model. Previous work showed how this projection problem can be solved for response families employed in generalized linear models and how an approximate latent-space approach can be used for many other response families. Here, we present an exact projection method for all response families with discrete and finite support, called the augmented-data projection. A simulation study for an ordinal response family shows that the proposed method performs better than or similarly to the previously proposed approximate latent-space projection. The cost of the slightly better performance of the augmented-data projection is a substantial increase in runtime. Thus, in such cases, we recommend the latent projection in the early phase of a model-building workflow and the augmented-data projection for final results. The ordinal response family from our simulation study is supported by both projection methods, but we also include a real-world cancer subtyping example with a nominal response family, a case that is not supported by the latent projection.
翻译:投影预测变量选择是一种基于决策理论的贝叶斯变量选择方法,在预测性能与稀疏性之间实现了卓越的权衡。其投影问题通常难以求解,因为该方法依赖于从所谓参考模型的条件后验预测分布到候选模型的参数条件预测分布之间的Kullback-Leibler散度。先前的研究展示了如何在广义线性模型中使用的响应族中求解该投影问题,以及如何通过近似潜变量方法处理其他诸多响应族。本文针对所有具有离散有限支撑的响应族提出了一种精确投影方法,称为增广数据投影。针对有序响应族的仿真研究表明,所提方法的性能优于或接近先前提出的近似潜变量投影方法。增广数据投影在性能上的微弱优势以运行时间显著增加为代价。因此,在此类情形下,我们建议在模型构建流程的早期阶段使用潜变量投影,而在最终结果阶段使用增广数据投影。我们的仿真研究中所涉及的有序响应族可同时支持两种投影方法,但本文还包含一个真实世界癌症亚型分类示例——该示例涉及名义响应族,而此类情况无法由潜变量投影方法处理。