Symbolic Regression (SR) is a task which aims to extract the mathematical expression underlying a set of empirical observations. Transformer-based methods trained on SR datasets detain the current state-of-the-art in this task, while the application of Large Language Models (LLMs) to SR remains unexplored. This work investigates the integration of pre-trained LLMs into the SR pipeline, utilizing an approach that iteratively refines a functional form based on the prediction error it achieves on the observation set, until it reaches convergence. Our method leverages LLMs to propose an initial set of possible functions based on the observations, exploiting their strong pre-training prior. These functions are then iteratively refined by the model itself and by an external optimizer for their coefficients. The process is repeated until the results are satisfactory. We then analyze Vision-Language Models in this context, exploring the inclusion of plots as visual inputs to aid the optimization process. Our findings reveal that LLMs are able to successfully recover good symbolic equations that fit the given data, outperforming SR baselines based on Genetic Programming, with the addition of images in the input showing promising results for the most complex benchmarks.
翻译:符号回归(SR)旨在从一组经验观测中提取其背后的数学表达式。基于Transformer的方法在SR数据集上训练后,在该任务中保持了当前最先进的性能,而大型语言模型(LLMs)在SR中的应用仍未被探索。本研究探讨了将预训练LLMs集成到SR流程中,采用一种基于预测误差迭代优化函数形式的方法,直至收敛。我们的方法利用LLMs基于观测提出一组初始候选函数,充分发挥其强大的预训练先验知识。随后,这些函数由模型本身和外部优化器对其系数进行迭代优化,重复此过程直至结果满意。我们还分析了该场景下的视觉-语言模型,探讨将图表作为视觉输入以辅助优化过程。研究结果表明,LLMs能够成功恢复与给定数据良好拟合的符号方程,其性能优于基于遗传编程的SR基线方法,且在输入中添加图像对于最复杂的基准测试显示出有前景的结果。