When choosing between competing symbolic models for a data set, a human will naturally prefer the "simpler" expression or the one which more closely resembles equations previously seen in a similar context. This suggests a non-uniform prior on functions, which is, however, rarely considered within a symbolic regression (SR) framework. In this paper we develop methods to incorporate detailed prior information on both functions and their parameters into SR. Our prior on the structure of a function is based on a $n$-gram language model, which is sensitive to the arrangement of operators relative to one another in addition to the frequency of occurrence of each operator. We also develop a formalism based on the Fractional Bayes Factor to treat numerical parameter priors in such a way that models may be fairly compared though the Bayesian evidence, and explicitly compare Bayesian, Minimum Description Length and heuristic methods for model selection. We demonstrate the performance of our priors relative to literature standards on benchmarks and a real-world dataset from the field of cosmology.
翻译:在数据集的竞争性符号模型选择中,人类自然会偏好"更简洁"的表达式,或更接近先前在类似语境中见过的方程形式。这表明函数先验并非均匀分布,然而这在符号回归框架中鲜被考虑。本文开发了将函数及其参数的详细先验信息纳入符号回归的方法。我们基于n-gram语言模型构建函数结构先验,该模型不仅关注各算符的出现频率,还对算符之间的相对排列顺序敏感。同时,我们基于分数贝叶斯因子构建形式体系来处理数值参数先验,使得模型可通过贝叶斯证据进行公平比较,并明确比较了贝叶斯、最小描述长度及启发式模型选择方法。我们通过基准测试和宇宙学领域的真实数据集,验证了所提先验方法相对于文献标准的性能表现。