Feature augmentation from one-to-many relationship tables is a critical but challenging problem in ML model development. To augment good features, data scientists need to come up with SQL queries manually, which is time-consuming. Featuretools [1] is a widely used tool by the data science community to automatically augment the training data by extracting new features from relevant tables. It represents each feature as a group-by aggregation SQL query on relevant tables and can automatically generate these SQL queries. However, it does not include predicates in these queries, which significantly limits its application in many real-world scenarios. To overcome this limitation, we propose FEATAUG, a new feature augmentation framework that automatically extracts predicate-aware SQL queries from one-to-many relationship tables. This extension is not trivial because considering predicates will exponentially increase the number of candidate queries. As a result, the original Featuretools framework, which materializes all candidate queries, will not work and needs to be redesigned. We formally define the problem and model it as a hyperparameter optimization problem. We discuss how the Bayesian Optimization can be applied here and propose a novel warm-up strategy to optimize it. To make our algorithm more practical, we also study how to identify promising attribute combinations for predicates. We show that how the beam search idea can partially solve the problem and propose several techniques to further optimize it. Our experiments on four real-world datasets demonstrate that FeatAug extracts more effective features compared to Featuretools and other baselines. The code is open-sourced at https://github.com/sfu-db/FeatAug
翻译:从一对多关系表中进行特征增强是机器学习模型开发中的一个关键但具有挑战性的问题。为增强优质特征,数据科学家需手动设计SQL查询,这一过程耗时费力。Featuretools[1]是数据科学社区广泛使用的工具,能从相关表中自动提取新特征以增强训练数据。它将每个特征表示为相关表上的分组聚合SQL查询,并可自动生成这些查询。然而,其查询中不包含谓词,这严重限制了它在许多实际场景中的应用。为解决此局限,我们提出FEATAUG,一种新型特征增强框架,能从一对多关系表中自动提取包含谓词的SQL查询。这一扩展并非易事,因为考虑谓词会使候选查询数量呈指数级增长。因此,原Featuretools框架中物化所有候选查询的方法将失效,需重新设计。我们正式定义该问题,并将其建模为超参数优化问题。我们讨论了贝叶斯优化在该场景下的应用,并提出一种新颖的预热策略来优化它。为使算法更实用,我们还研究了如何识别有前景的谓词语义组合。我们展示了光束搜索思想如何部分解决该问题,并提出了多项优化技术来进一步改进。在四个真实数据集上的实验表明,与Featuretools及其他基线相比,FeatAug能提取更有效的特征。代码已开源:https://github.com/sfu-db/FeatAug