We introduce MESSY estimation, a Maximum-Entropy based Stochastic and Symbolic densitY estimation method. The proposed approach recovers probability density functions symbolically from samples using moments of a Gradient flow in which the ansatz serves as the driving force. In particular, we construct a gradient-based drift-diffusion process that connects samples of the unknown distribution function to a guess symbolic expression. We then show that when the guess distribution has the maximum entropy form, the parameters of this distribution can be found efficiently by solving a linear system of equations constructed using the moments of the provided samples. Furthermore, we use Symbolic regression to explore the space of smooth functions and find optimal basis functions for the exponent of the maximum entropy functional leading to good conditioning. The cost of the proposed method in each iteration of the random search is linear with the number of samples and quadratic with the number of basis functions. We validate the proposed MESSY estimation method against other benchmark methods for the case of a bi-modal and a discontinuous density, as well as a density at the limit of physical realizability. We find that the addition of a symbolic search for basis functions improves the accuracy of the estimation at a reasonable additional computational cost. Our results suggest that the proposed method outperforms existing density recovery methods in the limit of a small to moderate number of samples by providing a low-bias and tractable symbolic description of the unknown density at a reasonable computational cost.
翻译:我们提出MESSY估计方法,一种基于最大熵的随机与符号密度估计技术。该方法通过利用梯度流的矩信息,从样本中符号化地恢复概率密度函数,其中预设的解析形式作为驱动力。具体而言,我们构建了一个基于梯度的漂移-扩散过程,该过程将未知分布函数的样本与猜测的符号表达式相关联。随后证明,当猜测分布具有最大熵形式时,可通过求解由样本矩构建的线性方程组高效确定该分布的参数。此外,我们采用符号回归探索光滑函数空间,为最大熵泛函的指数项寻找最优基函数以获得良好的条件数。在每次随机搜索迭代中,所提方法的计算复杂度与样本数量呈线性关系,与基函数数量呈二次关系。我们针对双峰分布、不连续分布以及处于物理可实现极限的密度函数,将MESSY估计方法与其他基准方法进行了对比验证。结果表明,通过额外引入符号化基函数搜索,可以在合理增加计算成本的情况下提升估计精度。我们的研究显示,在样本数量较少至中等的情况下,该方法能够以可接受的计算成本提供低偏差且可解析的密度符号描述,其性能优于现有密度恢复方法。