MESSY Estimation: Maximum-Entropy based Stochastic and Symbolic densitY Estimation

We introduce MESSY estimation, a Maximum-Entropy based Stochastic and Symbolic densitY estimation method. The proposed approach recovers probability density functions symbolically from samples using moments of a Gradient flow in which the ansatz serves as the driving force. In particular, we construct a gradient-based drift-diffusion process that connects samples of the unknown distribution function to a guess symbolic expression. We then show that when the guess distribution has the maximum entropy form, the parameters of this distribution can be found efficiently by solving a linear system of equations constructed using the moments of the provided samples. Furthermore, we use Symbolic regression to explore the space of smooth functions and find optimal basis functions for the exponent of the maximum entropy functional leading to good conditioning. The cost of the proposed method for each set of selected basis functions is linear with the number of samples and quadratic with the number of basis functions. However, the underlying acceptance/rejection procedure for finding optimal and well-conditioned bases adds to the computational cost. We validate the proposed MESSY estimation method against other benchmark methods for the case of a bi-modal and a discontinuous density, as well as a density at the limit of physical realizability. We find that the addition of a symbolic search for basis functions improves the accuracy of the estimation at a reasonable additional computational cost. Our results suggest that the proposed method outperforms existing density recovery methods in the limit of a small to moderate number of samples by providing a low-bias and tractable symbolic description of the unknown density at a reasonable computational cost.

翻译：我们提出了一种名为MESSY估计的基于最大熵的随机与符号密度估计方法。所提方法利用梯度流的矩，以特定假设形式作为驱动力，从样本中符号化地恢复概率密度函数。具体而言，我们构建了一个基于梯度的漂移-扩散过程，将未知分布函数的样本与猜测符号表达式相连接。随后证明，当猜测分布具有最大熵形式时，其参数可通过求解一个由样本矩构成的线性方程组高效确定。此外，我们采用符号回归探索光滑函数空间，为最大熵泛函的指数寻找最优基函数，从而确保良好的条件数。对于每组选定的基函数，所提方法的计算复杂度与样本数量呈线性关系，与基函数数量呈二次关系。然而，用于寻找最优且良态基函数的底层接受/拒绝过程会增加计算开销。我们针对双峰密度、不连续密度以及物理可实现极限密度案例，将所提出的MESSY估计方法与其他基准方法进行验证。结果表明，引入基函数的符号搜索能以合理的额外计算成本提升估计精度。我们的研究表明，在样本数量较少到中等的情况下，所提方法通过以可接受的计算代价提供低偏差且易处理的未知密度符号描述，优于现有密度恢复方法。