Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl

from arxiv, 24 pages, 5 figures, 3 tables. Feedback welcome. Paper source found at https://github.com/MilesCranmer/pysr_paper ; PySR at https://github.com/MilesCranmer/PySR ; SymbolicRegression.jl at https://github.com/MilesCranmer/SymbolicRegression.jl

PySR is an open-source library for practical symbolic regression, a type of machine learning which aims to discover human-interpretable symbolic models. PySR was developed to democratize and popularize symbolic regression for the sciences, and is built on a high-performance distributed back-end, a flexible search algorithm, and interfaces with several deep learning packages. PySR's internal search algorithm is a multi-population evolutionary algorithm, which consists of a unique evolve-simplify-optimize loop, designed for optimization of unknown scalar constants in newly-discovered empirical expressions. PySR's backend is the extremely optimized Julia library SymbolicRegression.jl, which can be used directly from Julia. It is capable of fusing user-defined operators into SIMD kernels at runtime, performing automatic differentiation, and distributing populations of expressions to thousands of cores across a cluster. In describing this software, we also introduce a new benchmark, "EmpiricalBench," to quantify the applicability of symbolic regression algorithms in science. This benchmark measures recovery of historical empirical equations from original and synthetic datasets.

翻译：PySR是一个用于实用符号回归的开源库，符号回归是一种旨在发现人类可解释符号模型的机器学习方法。PySR的开发旨在促进和普及符号回归在科学领域的应用，其构建于高性能分布式后端、灵活的搜索算法之上，并与多个深度学习包集成。PySR的内部搜索算法采用多群体进化算法，包含独特的“进化-简化-优化”循环，专门用于优化新发现经验表达式中的未知标量常数。PySR的后端是经过极致优化的Julia库SymbolicRegression.jl，可直接在Julia环境中使用。该库能够将用户自定义算子运行时融合为SIMD内核，支持自动微分，并能将表达式群体分布到集群中数千个核心上。在介绍该软件的同时，我们引入了一个新基准"EmpiricalBench"，用于量化符号回归算法在科学领域的适用性。该基准通过原始数据集和合成数据集，评估对历史经验方程的重现能力。

相关内容

搜索算法

关注 61

搜索算法是利用计算机的高性能来有目的的穷举一个问题解空间的部分或所有的可能情况，从而求出问题的解的一种方法。现阶段一般有枚举算法、深度优先搜索、广度优先搜索、A*算法、回溯算法、蒙特卡洛树搜索、散列函数等算法。在大规模实验环境中，通常通过在搜索前，根据条件降低搜索规模；根据问题的约束条件进行剪枝；利用搜索过程中的中间解，避免重复计算这几种方法进行优化。

【2023新书】使用Python进行统计和数据可视化，554页pdf

专知会员服务

130+阅读 · 2023年1月29日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务

162+阅读 · 2020年1月16日