Over recent decades, extensive research has aimed to overcome the restrictive underlying assumptions required for a Generalized Linear Model to generate accurate and meaningful predictions. These efforts include regularizing coefficients, selecting features, and clustering ordinal categories, among other approaches. Despite these advances, efficiently clustering nominal categories in GLMs without incurring high computational costs remains a challenge. This paper introduces Ranking to Variable Fusion (R2VF), a two-step method designed to efficiently fuse nominal and ordinal categories in GLMs. By first transforming nominal features into an ordinal framework via regularized regression and then applying variable fusion, R2VF strikes a balance between model complexity and interpretability. We demonstrate the effectiveness of R2VF through comparisons with other methods, highlighting its performance in addressing overfitting and identifying an appropriate set of covariates.
翻译:近几十年来,大量研究致力于克服广义线性模型为生成准确且有意义的预测所需的基础假设限制。这些努力包括正则化系数、特征选择以及序数类别聚类等多种方法。尽管取得了这些进展,如何在广义线性模型中高效地对名义类别进行聚类而不产生高昂计算成本仍然是一个挑战。本文提出了排序至变量融合(R2VF)方法,这是一种旨在高效融合广义线性模型中名义与序数类别的两步算法。通过首先利用正则化回归将名义特征转化为序数框架,再应用变量融合技术,R2VF在模型复杂度与可解释性之间实现了平衡。我们通过与其他方法的对比验证了R2VF的有效性,重点展示了其在处理过拟合问题和识别合适协变量集合方面的性能表现。