Recently, code generation driven by large language models (LLMs) has become increasingly popular. However, automatically generating code for machine learning (ML) tasks still poses significant challenges. This paper explores the limits of program synthesis for ML by combining LLMs and automated machine learning (autoML). Specifically, our goal is to fully automate the code generation process for the entire ML workflow, from data preparation to modeling and post-processing, utilizing only textual descriptions of the ML tasks. To manage the length and diversity of ML programs, we propose to break each ML program into smaller, manageable parts. Each part is generated separately by the LLM, with careful consideration of their compatibilities. To implement the approach, we design a testing technique for ML programs. Furthermore, our approach enables integration with autoML. In our approach, autoML serves to numerically assess and optimize the ML programs generated by LLMs. LLMs, in turn, help to bridge the gap between theoretical, algorithm-centered autoML and practical autoML applications. This mutual enhancement underscores the synergy between LLMs and autoML in program synthesis for ML. In experiments across various ML tasks, our method outperforms existing methods in 10 out of 12 tasks for generating ML programs. In addition, autoML significantly improves the performance of the generated ML programs. In the experiments, our method, Text-to-ML, achieves fully automated synthesis of the entire ML pipeline based solely on textual descriptions of the ML tasks.
翻译:近年来,由大语言模型驱动的代码生成日益普及。然而,为机器学习任务自动生成代码仍面临重大挑战。本文探索了通过结合大语言模型与自动机器学习来突破机器学习程序合成的极限。具体而言,我们的目标是仅利用机器学习任务的自然语言描述,实现从数据准备到建模再到后处理的完整机器学习工作流程代码生成的完全自动化。为应对机器学习程序长度与多样性的问题,我们提出将每个机器学习程序分解为更小、可控的模块,由大语言模型分别生成各模块,并仔细考量其兼容性。为实现该方法,我们设计了针对机器学习程序的测试技术,并进一步使其能与自动机器学习集成。在我们的方法中,自动机器学习用于数值评估与优化大语言模型生成的机器学习程序,而大语言模型则帮助弥合理论驱动的自动机器学习与实际应用之间的鸿沟。这种相互增强凸显了大语言模型与自动机器学习在机器学习程序合成中的协同作用。在涵盖多种机器学习任务的实验中,我们的方法在12项任务中有10项优于现有机器学习代码生成方法。此外,自动机器学习显著提升了所生成机器学习程序的性能。实验表明,我们提出的Text-to-ML方法仅基于机器学习任务的自然语言描述,即可实现完整机器学习流水线的全自动合成。