Symbolic Regression aims to find symbolic expressions that describe datasets. Due to better interpretability, it is a machine learning paradigm particularly powerful for scientific discovery. In recent years, several works have expanded the concept to allow the description of similar phenomena using a single expression with varying sets of parameters, thereby introducing categorical variables. Some previous works allow only "non-shared" (category-value-specific) parameters, and others also incorporate "shared" (category-value-agnostic) parameters. We expand upon those efforts by considering multiple categorical variables, and introducing intermediate levels of parameter sharing. With two categorical variables, an intermediate level of parameter sharing emerges, i.e., parameters which are shared across either category but change across the other. The new approach potentially decreases the number of parameters, while revealing additional information about the problem. Using a synthetic, fitting-only example, we test the limits of this setup in terms of data requirement reduction and transfer learning. As a real-world symbolic regression example, we demonstrate the benefits of the proposed approach on an astrophysics dataset used in a previous study, which considered only one categorical variable. We achieve a similar fit quality but require significantly fewer individual parameters, and extract additional information about the problem.
翻译:符号回归旨在寻找描述数据集的符号表达式。由于其更好的可解释性,这一机器学习范式在科学发现领域展现出独特优势。近年来,多项研究扩展了这一概念,允许使用具有不同参数集的单一表达式来描述相似现象,从而引入了分类变量。先前研究或仅允许"非共享"(特定于类别值的)参数,或同时纳入"共享"(与类别值无关的)参数。我们在这些研究基础上进一步拓展,考虑多个分类变量并引入中间层级的参数共享机制。当存在两个分类变量时,会产生中间层级的参数共享模式,即参数在某个类别维度上保持共享,而在另一类别维度上发生变化。这种新方法在减少参数数量的同时,能够揭示问题的附加信息。通过合成数据拟合实验,我们测试了该方法在降低数据需求与迁移学习方面的性能边界。在真实世界的符号回归案例中,我们将所提方法应用于天体物理学数据集(该数据集在先前仅考虑单分类变量的研究中曾被使用),在保持相近拟合精度的前提下,显著减少了独立参数数量,并提取出关于该问题的额外信息。