Before applying data analytics or machine learning to a data set, a vital step is usually the construction of an informative set of features from the data. In this paper, we present SMARTFEAT, an efficient automated feature engineering tool to assist data users, even non-experts, in constructing useful features. Leveraging the power of Foundation Models (FMs), our approach enables the creation of new features from the data, based on contextual information and open-world knowledge. To achieve this, our method incorporates an intelligent operator selector that discerns a subset of operators, effectively avoiding exhaustive combinations of original features, as is typically observed in traditional automated feature engineering tools. Moreover, we address the limitations of performing data tasks through row-level interactions with FMs, which could lead to significant delays and costs due to excessive API calls. To tackle this, we introduce a function generator that facilitates the acquisition of efficient data transformations, such as dataframe built-in methods or lambda functions, ensuring the applicability of SMARTFEAT to generate new features for large datasets. With SMARTFEAT, dataset users can efficiently search for and apply transformations to obtain new features, leading to improvements in the AUC of downstream ML classification by up to 29.8%.
翻译:在对数据集进行数据分析或机器学习之前,一个关键步骤通常是从数据中构建一组信息丰富的特征。本文提出了SMARTFEAT,一种高效的自动化特征工程工具,旨在帮助数据用户(即使是非专业人士)构建有用的特征。通过利用基础模型的能力,该方法能够基于上下文信息和开放世界知识从数据中创建新特征。为此,我们的方法引入了一个智能算子选择器,用于筛选出算子的子集,从而有效避免传统自动化特征工程工具中常见的原始特征穷举组合。此外,我们解决了通过基础模型逐行交互执行数据任务的局限性——这可能导致因过多API调用而引发的显著延迟和成本问题。针对这一点,我们提出了一种函数生成器,能够促进高效数据转换(如数据框内置方法或lambda函数)的获取,确保SMARTFEAT适用于大型数据集的新特征生成。通过SMARTFEAT,数据集用户可以高效搜索并应用转换以获得新特征,从而使下游机器学习分类的AUC提升高达29.8%。