Symbolic regression (SR) is a challenging task in machine learning that involves finding a mathematical expression for a function based on its values. Recent advancements in SR have demonstrated the efficacy of pretrained transformer-based models for generating equations as sequences, which benefit from large-scale pretraining on synthetic datasets and offer considerable advantages over GP-based methods in terms of inference time. However, these models focus on supervised pretraining goals borrowed from text generation and ignore equation-specific objectives like accuracy and complexity. To address this, we propose TPSR, a Transformer-based Planning strategy for Symbolic Regression that incorporates Monte Carlo Tree Search into the transformer decoding process. TPSR, as opposed to conventional decoding strategies, allows for the integration of non-differentiable feedback, such as fitting accuracy and complexity, as external sources of knowledge into the equation generation process. Extensive experiments on various datasets show that our approach outperforms state-of-the-art methods, enhancing the model's fitting-complexity trade-off, extrapolation abilities, and robustness to noise. We also demonstrate that the utilization of various caching mechanisms can further enhance the efficiency of TPSR.
翻译:符号回归(SR)是机器学习中的一项具有挑战性的任务,其目标是根据函数值找出相应的数学表达式。近年来,SR研究的最新进展表明,基于预训练Transformer的模型在将方程生成为序列方面具有显著效果——这类模型受益于在合成数据集上进行的大规模预训练,且在推理速度上相比基于遗传编程的方法具有显著优势。然而,这些模型沿用了文本生成领域的有监督预训练目标,忽视了精度与复杂度等面向方程的特殊目标。为解决这一问题,我们提出TPSR——一种基于Transformer的符号回归规划策略,该方法将蒙特卡洛树搜索融入Transformer解码过程。与传统的解码策略不同,TPSR能够将拟合精度与复杂度等不可微反馈作为外部知识源整合到方程生成过程中。在多个数据集上的大量实验表明,我们的方法优于当前最优方法,在模型的拟合-复杂度权衡、外推能力及对噪声的鲁棒性方面均有提升。我们还证明,采用多种缓存机制可进一步提升TPSR的效率。