The goal of automated feature generation is to liberate machine learning experts from the laborious task of manual feature generation, which is crucial for improving the learning performance of tabular data. The major challenge in automated feature generation is to efficiently and accurately identify effective features from a vast pool of candidate features. In this paper, we present OpenFE, an automated feature generation tool that provides competitive results against machine learning experts. OpenFE achieves high efficiency and accuracy with two components: 1) a novel feature boosting method for accurately evaluating the incremental performance of candidate features and 2) a two-stage pruning algorithm that performs feature pruning in a coarse-to-fine manner. Extensive experiments on ten benchmark datasets show that OpenFE outperforms existing baseline methods by a large margin. We further evaluate OpenFE in two Kaggle competitions with thousands of data science teams participating. In the two competitions, features generated by OpenFE with a simple baseline model can beat 99.3% and 99.6% data science teams respectively. In addition to the empirical results, we provide a theoretical perspective to show that feature generation can be beneficial in a simple yet representative setting. The code is available at https://github.com/ZhangTP1996/OpenFE.
翻译:自动化特征生成的目标是将机器学习专家从繁重的手动特征生成任务中解放出来,这对于提升表格数据的学习性能至关重要。自动化特征生成的主要挑战在于如何从海量候选特征中高效且准确地识别有效特征。本文提出了OpenFE,一种能提供与机器学习专家相竞争结果的自动化特征生成工具。OpenFE通过两个组件实现高效率和准确性:1)一种新型特征提升方法,用于准确评估候选特征的增量性能;2)一种两阶段剪枝算法,以由粗到细的方式进行特征剪枝。在十个基准数据集上的大量实验表明,OpenFE的性能大幅优于现有基线方法。我们进一步在两项有数千个数据科学团队参与的Kaggle竞赛中评估了OpenFE。在这两项竞赛中,使用简单基线模型搭配OpenFE生成的特征,分别可以击败99.3%和99.6%的数据科学团队。除实证结果外,我们还在一个简单且具有代表性的场景中从理论上论证了特征生成的有效性。代码可在https://github.com/ZhangTP1996/OpenFE获取。