Symbolic regression (SR) is a challenging task in machine learning that involves finding a mathematical expression for a function based on its values. Recent advancements in SR have demonstrated the efficacy of pretrained transformer-based models for generating equations as sequences, which benefit from large-scale pretraining on synthetic datasets and offer considerable advantages over GP-based methods in terms of inference time. However, these models focus on supervised pretraining goals borrowed from text generation and ignore equation-specific objectives like accuracy and complexity. To address this, we propose TPSR, a Transformer-based Planning strategy for Symbolic Regression that incorporates Monte Carlo Tree Search into the transformer decoding process. TPSR, as opposed to conventional decoding strategies, allows for the integration of non-differentiable feedback, such as fitting accuracy and complexity, as external sources of knowledge into the equation generation process. Extensive experiments on various datasets show that our approach outperforms state-of-the-art methods, enhancing the model's fitting-complexity trade-off, extrapolation abilities, and robustness to noise. We also demonstrate that the utilization of various caching mechanisms can further enhance the efficiency of TPSR.
翻译:符号回归(SR)是机器学习中一项具有挑战性的任务,旨在根据函数值寻找其数学表达式。近年来的研究表明,基于预训练Transformer的模型在将方程生成为序列方面成效显著,得益于在大规模合成数据集上的预训练,其在推理时间上相比基于遗传编程的方法具有显著优势。然而,这些模型沿用了文本生成领域的监督预训练目标,忽略了精度和复杂度等方程特有的目标。为此,我们提出TPSR——一种结合蒙特卡洛树搜索与Transformer解码过程的符号回归规划策略。与常规解码策略不同,TPSR允许将不可微分的反馈(如拟合精度和复杂度)作为外部知识源融入方程生成过程。在多个数据集上的大量实验表明,我们的方法优于现有最先进方法,有效改善了模型的拟合-复杂度权衡、外推能力以及对噪声的鲁棒性。我们还证明,采用多种缓存机制可以进一步提升TPSR的效率。