LLM-Meta-SR: In-Context Learning for Evolving Selection Operators in Symbolic Regression

Large language models (LLMs) have revolutionized algorithm development, yet their application in symbolic regression, where algorithms automatically discover symbolic expressions from data, remains limited. In this paper, we propose a meta-learning framework that enables LLMs to automatically design selection operators for evolutionary symbolic regression algorithms. We first identify two key limitations in existing LLM-based algorithm evolution techniques: lack of semantic guidance and code bloat. The absence of semantic awareness can lead to ineffective exchange of useful code components, while bloat results in unnecessarily complex components; both can hinder evolutionary learning progress or reduce the interpretability of the designed algorithm. To address these issues, we enhance the LLM-based evolution framework for meta-symbolic regression with two key innovations: a complementary, semantics-aware selection operator and bloat control. Additionally, we embed domain knowledge into the prompt, enabling the LLM to generate more effective and contextually relevant selection operators. Our experimental results on symbolic regression benchmarks show that LLMs can devise selection operators that outperform nine expert-designed baselines, achieving state-of-the-art performance. Moreover, the evolved operator can further improve a state-of-the-art symbolic regression algorithm, achieving the best performance among 28 symbolic regression and other machine learning algorithms across 116 regression datasets. This demonstrates that LLMs can exceed expert-level algorithm design for symbolic regression.

翻译：大型语言模型已革新算法开发流程，但其在符号回归（让算法从数据中自动发现符号表达式）领域的应用仍十分有限。本文提出一个元学习框架，使大语言模型能够为进化式符号回归算法自动设计选择算子。我们首先指出现有基于大语言模型的算法演化技术存在两个关键缺陷：语义指导缺失与代码膨胀问题。语义感知能力的缺失会导致有用代码组件无效交换，而代码膨胀则产生不必要的复杂组件；这两者均会阻碍进化学习进程或降低所设计算法的可解释性。为解决这些问题，我们通过两项关键创新增强基于大语言模型的元符号回归演化框架：互补性语义感知选择算子与膨胀控制机制。此外，我们将领域知识嵌入提示模板，使大语言模型能生成更有效且情境相关的选择算子。在符号回归基准上的实验结果表明，大语言模型可设计出优于九种专家基线方法的进化算子，达到当前最佳性能。更重要的是，演化出的算子能进一步改进现有最优符号回归算法，在116个回归数据集上超越28种符号回归及其他机器学习算法，实现最佳性能。这证明大语言模型在符号回归算法设计方面可超越专家水平。