Hippasus: Effective and Efficient Automatic Feature Augmentation for Machine Learning Tasks on Relational Data

Machine learning models depend critically on feature quality, yet useful features are often scattered across multiple relational tables. Feature augmentation enriches a base table by discovering and integrating features from related tables through join operations. However, scaling this process to complex schemas with many tables and multi-hop paths remains challenging. Feature augmentation must address three core tasks: identify promising join paths that connect the base table to candidate tables, execute these joins to materialize augmented data, and select the most informative features from the results. Existing approaches face a fundamental tradeoff between effectiveness and efficiency: achieving high accuracy requires exploring many candidate paths, but exhaustive exploration is computationally prohibitive. Some methods compromise by considering only immediate neighbors, limiting their effectiveness, while others employ neural models that require expensive training data and introduce scalability limitations. We present Hippasus, a modular framework that achieves both goals through three key contributions. First, we combine lightweight statistical signals with semantic reasoning from Large Language Models to prune unpromising join paths before execution, focusing computational resources on high-quality candidates. Second, we employ optimized multi-way join algorithms and consolidate features from multiple paths, substantially reducing execution time. Third, we integrate LLM-based semantic understanding with statistical measures to select features that are both semantically meaningful and empirically predictive. Our experimental evaluation on publicly available datasets shows that Hippasus substantially improves feature augmentation accuracy by up to 26.8% over state-of-the-art baselines while also offering high runtime performance.

翻译：机器学习模型的关键性能取决于特征质量，然而有价值的特征往往分散在多个关系表中。特征增强通过发现并整合相关表中的特征（借助连接操作）来丰富基础表。但在具有多表和多跳路径的复杂模式中扩展此过程仍具挑战性。特征增强需解决三个核心任务：识别连接基础表与候选表的有前景连接路径、执行连接以物化增强数据、从结果中选择信息量最大的特征。现有方法面临效果与效率的根本性权衡：实现高准确率需探索大量候选路径，但穷举探索在计算上不可行。部分方法仅考虑直接邻接表以折中效果，另一些采用需要昂贵训练数据且存在可扩展性限制的神经网络模型。本文提出Hippasus模块化框架，通过三项关键贡献同时实现两个目标。首先，我们将轻量级统计信号与大型语言模型的语义推理相结合，在执行前剪枝无前景的连接路径，将计算资源集中于高质量候选路径。其次，我们采用优化的多路连接算法并整合多路径特征，显著减少执行时间。最后，我们融合基于LLM的语义理解与统计度量，选择兼具语义意义与实证预测能力的特征。在公开数据集上的实验评估表明，Hippasus相较于最先进的基线方法，将特征增强准确率最高提升26.8%，同时保持优异的运行时性能。