High\-cardinality categorical variables pose significant challenges in machine learning, particularly in terms of computational efficiency and model interpretability. Traditional one\-hot encoding often results in high\-dimensional sparse feature spaces, increasing the risk of overfitting and reducing scalability. This paper introduces novel encoding techniques, including means encoding, low\-rank encoding, and multinomial logistic regression encoding, to address these challenges. These methods leverage sufficient representations to generate compact and informative embeddings of categorical data. We conduct rigorous theoretical analyses and empirical validations on diverse datasets, demonstrating significant improvements in model performance and computational efficiency compared to baseline methods. The proposed techniques are particularly effective in domains requiring scalable solutions for large datasets, paving the way for more robust and efficient applications in machine learning.
翻译:高基数分类变量在机器学习中带来了显著挑战,特别是在计算效率和模型可解释性方面。传统的独热编码通常会导致高维稀疏特征空间,增加了过拟合风险并降低了可扩展性。本文引入了新颖的编码技术,包括均值编码、低秩编码和多项逻辑回归编码,以应对这些挑战。这些方法利用充分表示生成紧凑且信息丰富的分类数据嵌入表示。我们在多样化数据集上进行了严格的理论分析和实证验证,结果表明相较于基线方法,所提技术在模型性能和计算效率方面均有显著提升。这些方法尤其适用于需要大规模数据集可扩展解决方案的领域,为机器学习中更鲁棒和高效的应用铺平了道路。