Symbolic regression (SR) aims to discover interpretable analytical expressions that accurately describe observed data. Amortized SR promises to be much more efficient than the predominant genetic programming SR methods, but currently struggles to scale to realistic scientific complexity. We find that a key obstacle is the lack of a fast reduction of equivalent expressions to a concise normalized form. Amortized SR has addressed this with general-purpose Computer Algebra Systems (CAS) like SymPy, but the high computational cost severely limits training and inference speed. We propose SimpliPy, a rule-based simplification engine achieving a 100-fold speed-up over SymPy at comparable quality. This enables substantial improvements in amortized SR, including scalability to much larger training sets, more efficient use of the per-expression token budget, and systematic training set decontamination with respect to equivalent test expressions. We demonstrate these advantages in our Flash-ANSR framework, which achieves much better accuracy than amortized baselines (NeSymReS, E2E) on the FastSRB benchmark. Moreover, it performs on par with state-of-the-art direct optimization (PySR) while recovering more concise rather than more complex expressions with increasing inference budget.
翻译:符号回归旨在发现能够准确描述观测数据的可解释解析表达式。摊销符号回归有望比主流的遗传编程符号回归方法高效得多,但目前难以扩展到现实科学问题的复杂度。我们发现一个关键障碍是缺乏将等价表达式快速约简为简洁规范形式的方法。摊销符号回归已采用通用计算机代数系统(如SymPy)来解决此问题,但高昂的计算代价严重限制了训练和推理速度。我们提出SimpliPy——一种基于规则的简化引擎,在质量相当的前提下实现了比SymPy百倍以上的加速。这带来了摊销符号回归的显著改进,包括可扩展至更大规模训练集、更高效地利用每个表达式的词元预算,以及针对等价测试表达式的系统性训练集去污染。我们在Flash-ANSR框架中展示了这些优势,该框架在FastSRB基准测试中相比摊销基线(NeSymReS、E2E)取得了更优的准确率。此外,其性能与最先进的直接优化方法(PySR)相当,且随着推理预算增加能恢复更简洁而非更复杂的表达式。