Symbolic regression, as one of the most crucial tasks in AI for science, discovers governing equations from experimental data. Popular approaches based on genetic programming, Monte Carlo tree search, or deep reinforcement learning learn symbolic regression from a fixed dataset. They require massive datasets and long training time especially when learning complex equations involving many variables. Recently, Control Variable Genetic Programming (CVGP) has been introduced which accelerates the regression process by discovering equations from designed control variable experiments. However, the set of experiments is fixed a-priori in CVGP and we observe that sub-optimal selection of experiment schedules delay the discovery process significantly. To overcome this limitation, we propose Racing Control Variable Genetic Programming (Racing-CVGP), which carries out multiple experiment schedules simultaneously. A selection scheme similar to that used in selecting good symbolic equations in the genetic programming process is implemented to ensure that promising experiment schedules eventually win over the average ones. The unfavorable schedules are terminated early to save time for the promising ones. We evaluate Racing-CVGP on several synthetic and real-world datasets corresponding to true physics laws. We demonstrate that Racing-CVGP outperforms CVGP and a series of symbolic regressors which discover equations from fixed datasets.
翻译:符号回归作为人工智能科学领域中最关键的任务之一,能够从实验数据中发现控制方程。基于遗传编程、蒙特卡洛树搜索或深度强化学习的流行方法通过固定数据集学习符号回归,这些方法需要大量数据集和较长训练时间,尤其是在学习涉及多个变量的复杂方程时。最近提出的控制变量遗传编程通过设计控制变量实验加速回归过程,但该方法的实验集合是预先固定的,并且我们观察到次优的实验调度选择会显著延缓发现过程。为克服这一局限性,我们提出赛车控制变量遗传编程,该方法同时执行多个实验调度。通过实现类似遗传编程中选择优质符号方程的选择机制,确保有前途的实验调度最终优于普通调度。不理想的调度会被提前终止,从而为有希望的调度节省时间。我们在对应真实物理定律的多个合成数据集和真实数据集上评估了Racing-CVGP,结果表明Racing-CVGP优于CVGP及一系列从固定数据集发现方程的符号回归器。