Symbolic regression, the task of extracting mathematical expressions from the observed data $\{ \vx_i, y_i \}$, plays a crucial role in scientific discovery. Despite the promising performance of existing methods, most of them conduct symbolic regression in an \textit{offline} setting. That is, they treat the observed data points as given ones that are simply sampled from uniform distributions without exploring the expressive potential of data. However, for real-world scientific problems, the data used for symbolic regression are usually actively obtained by doing experiments, which is an \textit{online} setting. Thus, how to obtain informative data that can facilitate the symbolic regression process is an important problem that remains challenging. In this paper, we propose QUOSR, a \textbf{qu}ery-based framework for \textbf{o}nline \textbf{s}ymbolic \textbf{r}egression that can automatically obtain informative data in an iterative manner. Specifically, at each step, QUOSR receives historical data points, generates new $\vx$, and then queries the symbolic expression to get the corresponding $y$, where the $(\vx, y)$ serves as new data points. This process repeats until the maximum number of query steps is reached. To make the generated data points informative, we implement the framework with a neural network and train it by maximizing the mutual information between generated data points and the target expression. Through comprehensive experiments, we show that QUOSR can facilitate modern symbolic regression methods by generating informative data.
翻译:符号回归是从观测数据$\{ \vx_i, y_i \}$中提取数学表达式的任务,在科学发现中发挥着关键作用。尽管现有方法表现出色,但大多数方法都是在\textit{离线}环境下进行符号回归,即将观测数据点视为从均匀分布中简单采样的给定数据,而未能探索数据的表达潜力。然而,在现实世界的科学问题中,用于符号回归的数据通常是通过实验主动获取的,这属于\textit{在线}环境。因此,如何获取能够促进符号回归过程的信息性数据,仍是一个重要的挑战性问题。本文提出QUOSR,一个基于\textbf{查}询的\textbf{在线}\textbf{符号}\textbf{回归}框架,能够以迭代方式自动获取信息性数据。具体而言,每一步中,QUOSR接收历史数据点,生成新的$\vx$,然后查询符号表达式以获取对应的$y$,其中$(\vx, y)$作为新增数据点。此过程重复直至达到最大查询步数。为使生成的数据点具有信息性,我们采用神经网络实现该框架,并通过最大化生成数据点与目标表达式之间的互信息进行训练。综合实验表明,QUOSR能通过生成信息性数据来促进现代符号回归方法的性能提升。