Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks. The code is available at: https://github.com/nikhilsab/LLMFE
翻译:摘要:自动特征工程在提升表格学习任务中预测模型性能方面发挥着关键作用。传统自动特征工程方法受限于在固定、人工设计的搜索空间内依赖预定义变换,往往忽视了领域知识。近期利用大型语言模型的进展使领域知识得以融入特征工程流程。然而,现有基于大型语言模型的方法采用直接提示或仅依赖验证分数进行特征选择,未能利用先前特征发现实验中的见解,或在特征生成与数据驱动性能之间建立有意义的推理关系。为应对这些挑战,我们提出LLM-FE——一种结合进化搜索与大型语言模型的领域知识和推理能力的创新框架,能够自动为表格学习任务发现有效特征。LLM-FE将特征工程形式化为程序搜索问题,其中大型语言模型迭代提出新的特征变换程序,而数据驱动的反馈则引导搜索过程。我们的结果表明,LLM-FE持续超越当前最优基线模型,显著提升了分类与回归基准测试中表格预测模型的性能。代码开源地址:https://github.com/nikhilsab/LLMFE