Symbolic Regression (SR) tries to reveal the hidden equations behind observed data. However, most methods search within a discrete equation space, where the structural modifications of equations rarely align with their numerical behavior, leaving fitting error feedback too noisy to guide exploration. To address this challenge, we propose GenSR, a generative latent space-based SR framework following the `map construction -> coarse localization -> fine search'' paradigm. Specifically, GenSR first pretrains a dual-branch Conditional Variational Autoencoder (CVAE) to reparameterize symbolic equations into a generative latent space with symbolic continuity and local numerical smoothness. This space can be regarded as a well-structured `map'' of the equation space, providing directional signals for search. At inference, the CVAE coarsely localizes the input data to promising regions in the latent space. Then, a modified CMA-ES refines the candidate region, leveraging smooth latent gradients. From a Bayesian perspective, GenSR reframes the SR task as maximizing the conditional distribution $p(\mathrm{Equ.} \mid \mathrm{Num.})$, with CVAE training achieving this objective through the Evidence Lower Bound (ELBO). This new perspective provides a theoretical guarantee for the effectiveness of GenSR. Extensive experiments show that GenSR jointly optimizes predictive accuracy, expression simplicity, and computational efficiency, while remaining robust under noise.
翻译:符号回归(SR)旨在揭示观测数据背后的隐藏方程。然而,大多数方法在离散的方程空间中进行搜索,其中方程的结构修改很少与其数值行为保持一致,导致拟合误差反馈过于嘈杂而难以指导探索。为应对这一挑战,我们提出GenSR,一种基于生成式隐空间的SR框架,遵循“地图构建→粗粒度定位→精细搜索”的范式。具体而言,GenSR首先预训练一个双分支条件变分自编码器(CVAE),将符号方程重新参数化为一个具有符号连续性和局部数值平滑性的生成式隐空间。该空间可被视为方程空间的一张结构良好的“地图”,为搜索提供方向性信号。在推理阶段,CVAE将输入数据粗粒度定位到隐空间中的有前景区域。随后,改进的CMA-ES算法利用平滑的隐空间梯度对候选区域进行精细化搜索。从贝叶斯视角看,GenSR将SR任务重新定义为最大化条件分布$p(\mathrm{Equ.} \mid \mathrm{Num.})$,而CVAE训练通过证据下界(ELBO)实现该目标。这一新视角为GenSR的有效性提供了理论保证。大量实验表明,GenSR能够同时优化预测精度、表达式简洁性与计算效率,并在噪声环境下保持鲁棒性。