Many real-world machine learning applications are characterized by a huge number of features, leading to computational and memory issues, as well as the risk of overfitting. Ideally, only relevant and non-redundant features should be considered to preserve the complete information of the original data and limit the dimensionality. Dimensionality reduction and feature selection are common preprocessing techniques addressing the challenge of efficiently dealing with high-dimensional data. Dimensionality reduction methods control the number of features in the dataset while preserving its structure and minimizing information loss. Feature selection aims to identify the most relevant features for a task, discarding the less informative ones. Previous works have proposed approaches that aggregate features depending on their correlation without discarding any of them and preserving their interpretability through aggregation with the mean. A limitation of methods based on correlation is the assumption of linearity in the relationship between features and target. In this paper, we relax such an assumption in two ways. First, we propose a bias-variance analysis for general models with additive Gaussian noise, leading to a dimensionality reduction algorithm (NonLinCFA) which aggregates non-linear transformations of features with a generic aggregation function. Then, we extend the approach assuming that a generalized linear model regulates the relationship between features and target. A deviance analysis leads to a second dimensionality reduction algorithm (GenLinCFA), applicable to a larger class of regression problems and classification settings. Finally, we test the algorithms on synthetic and real-world datasets, performing regression and classification tasks, showing competitive performances.
翻译:许多现实世界的机器学习应用具有大量特征的特点,导致计算和内存问题,以及过拟合的风险。理想情况下,应仅考虑相关且非冗余的特征,以保留原始数据的完整信息并限制维度。降维和特征选择是处理高维数据挑战的常见预处理技术。降维方法控制数据集中的特征数量,同时保留其结构并最小化信息损失。特征选择旨在识别任务最相关的特征,丢弃信息量较少的特征。先前的工作提出了根据特征相关性进行聚合的方法,而不丢弃任何特征,并通过均值聚合保留其可解释性。基于相关性的方法的一个局限性是假设特征与目标之间存在线性关系。在本文中,我们从两个方面放宽了这一假设。首先,我们针对具有加性高斯噪声的通用模型提出了一种偏差-方差分析,从而得到一种降维算法(NonLinCFA),该算法使用通用聚合函数对特征的非线性变换进行聚合。然后,我们扩展了该方法,假设广义线性模型调节特征与目标之间的关系。基于偏差分析,我们得到了第二种降维算法(GenLinCFA),适用于更大类别的回归问题和分类场景。最后,我们在合成数据集和真实数据集上测试了这些算法,执行回归和分类任务,展示了具有竞争力的性能。