Symbolic regression has recently gained traction in AI-driven scientific discovery, aiming to recover explicit closed-form expressions from data that reveal underlying physical laws. Despite recent advances, existing methods remain dominated by heuristic search algorithms or data-intensive approaches that assume low-noise regimes and lack principled uncertainty quantification. Fully probabilistic formulations are scarce, and existing Markov chain Monte Carlo-based Bayesian methods often struggle to efficiently explore the highly multimodal combinatorial space of symbolic expressions. We introduce VaSST, a scalable probabilistic framework for symbolic regression based on variational inference. VaSST employs a continuous relaxation of symbolic expression trees, termed soft symbolic trees, where discrete operator and feature assignments are replaced by soft distributions over allowable components. This relaxation transforms the combinatorial search over an astronomically large symbolic space into an efficient gradient-based optimization problem while preserving a coherent probabilistic interpretation. The learned soft representations induce posterior distributions over symbolic structures, enabling principled uncertainty quantification. Across simulated experiments and Feynman Symbolic Regression Database within SRBench, VaSST achieves superior performance in both structural recovery and predictive accuracy compared to state-of-the-art symbolic regression methods.
翻译:符号回归近年来在人工智能驱动的科学发现中备受关注,旨在从数据中恢复揭示潜在物理规律的显式闭合表达式。尽管取得了进展,现有方法仍主要受限于启发式搜索算法或依赖无噪声假设且缺乏原则性不确定性量化的数据密集型方法。完全概率形式的函数表达式模型十分稀少,且现有的基于马尔可夫链蒙特卡洛的贝叶斯方法往往难以高效探索符号表达式的高度多模态组合空间。我们提出VaSST,一种基于变分推断的可扩展符号回归概率框架。VaSST采用符号表达式树的连续松弛形式,即软符号树,其中离散算子和特征赋值被替换为允许组件的软分布。这种松弛将极大规模符号空间上的组合搜索转化为高效的梯度优化问题,同时保持连贯的概率解释。学习得到的软表示诱导出符号结构上的后验分布,从而实现原则性的不确定性量化。在模拟实验和SRBench中的费曼符号回归数据库上,与最先进的符号回归方法相比,VaSST在结构恢复和预测准确性方面均实现了卓越性能。