Discovering Mathematical Formulas from Data via GPT-guided Monte Carlo Tree Search

Finding a concise and interpretable mathematical formula that accurately describes the relationship between each variable and the predicted value in the data is a crucial task in scientific research, as well as a significant challenge in artificial intelligence. This problem is referred to as symbolic regression, which is an NP-hard problem. In the previous year, a novel symbolic regression methodology utilizing Monte Carlo Tree Search (MCTS) was advanced, achieving state-of-the-art results on a diverse range of datasets. although this algorithm has shown considerable improvement in recovering target expressions compared to previous methods, the lack of guidance during the MCTS process severely hampers its search efficiency. Recently, some algorithms have added a pre-trained policy network to guide the search of MCTS, but the pre-trained policy network generalizes poorly. To optimize the trade-off between efficiency and versatility, we introduce SR-GPT, a novel algorithm for symbolic regression that integrates Monte Carlo Tree Search (MCTS) with a Generative Pre-Trained Transformer (GPT). By using GPT to guide the MCTS, the search efficiency of MCTS is significantly improved. Next, we utilize the MCTS results to further refine the GPT, enhancing its capabilities and providing more accurate guidance for the MCTS. MCTS and GPT are coupled together and optimize each other until the target expression is successfully determined. We conducted extensive evaluations of SR-GPT using 222 expressions sourced from over 10 different symbolic regression datasets. The experimental results demonstrate that SR-GPT outperforms existing state-of-the-art algorithms in accurately recovering symbolic expressions both with and without added noise.

翻译：在科学研究中，找到简洁且可解释的数学公式，准确描述每个变量与预测值之间的关系，是一项关键任务，也是人工智能领域的重大挑战。这一问题被称为符号回归，属于NP难问题。去年，一种利用蒙特卡洛树搜索（MCTS）的新型符号回归方法被提出，并在多种数据集上取得了最优结果。尽管该算法在恢复目标表达式方面相比以往方法有显著提升，但MCTS过程中缺乏引导严重阻碍了其搜索效率。近期，一些算法通过添加预训练策略网络来引导MCTS搜索，但预训练策略网络的泛化能力较差。为优化效率与通用性之间的权衡，我们提出了SR-GPT——一种将蒙特卡洛树搜索（MCTS）与生成式预训练Transformer（GPT）相结合的新型符号回归算法。通过使用GPT引导MCTS，MCTS的搜索效率得到显著提升。接着，我们利用MCTS结果进一步优化GPT，增强其能力，并为MCTS提供更精确的引导。MCTS与GPT相互耦合、相互优化，直至成功确定目标表达式。我们使用来自10多个不同符号回归数据集的222个表达式对SR-GPT进行了广泛评估。实验结果表明，无论是在无噪声还是有噪声情况下，SR-GPT在准确恢复符号表达式方面均优于现有最优算法。