Maximum entropy (Maxent) models are a class of statistical models that use the maximum entropy principle to estimate probability distributions from data. Due to the size of modern data sets, Maxent models need efficient optimization algorithms to scale well for big data applications. State-of-the-art algorithms for Maxent models, however, were not originally designed to handle big data sets; these algorithms either rely on technical devices that may yield unreliable numerical results, scale poorly, or require smoothness assumptions that many practical Maxent models lack. In this paper, we present novel optimization algorithms that overcome the shortcomings of state-of-the-art algorithms for training large-scale, non-smooth Maxent models. Our proposed first-order algorithms leverage the Kullback-Leibler divergence to train large-scale and non-smooth Maxent models efficiently. For Maxent models with discrete probability distribution of $n$ elements built from samples, each containing $m$ features, the stepsize parameters estimation and iterations in our algorithms scale on the order of $O(mn)$ operations and can be trivially parallelized. Moreover, the strong $\ell_{1}$ convexity of the Kullback--Leibler divergence allows for larger stepsize parameters, thereby speeding up the convergence rate of our algorithms. To illustrate the efficiency of our novel algorithms, we consider the problem of estimating probabilities of fire occurrences as a function of ecological features in the Western US MTBS-Interagency wildfire data set. Our numerical results show that our algorithms outperform the state of the arts by one order of magnitude and yield results that agree with physical models of wildfire occurrence and previous statistical analyses of wildfire drivers.
翻译:最大熵(Maxent)模型是一类利用最大熵原理从数据中估计概率分布的统计模型。由于现代数据集的规模庞大,最大熵模型需要高效的优化算法才能在大数据应用中良好扩展。然而,现有最优的最大熵模型算法最初并非为处理大数据集而设计;这些算法要么依赖可能导致不可靠数值结果的技术手段,要么扩展性差,要么要求许多实际最大熵模型不具备的光滑性假设。本文提出了新颖的优化算法,克服了现有最优算法在训练大规模非光滑最大熵模型时的缺陷。我们提出的一阶算法利用Kullback-Leibler散度高效地训练大规模非光滑最大熵模型。对于由样本构建的包含$n$个元素的离散概率分布的最大熵模型(每个样本包含$m$个特征),我们算法中的步长参数估计和迭代复杂度为$O(mn)$量级,且可轻松并行化。此外,Kullback-Leibler散度的强$\ell_{1}$凸性允许使用更大的步长参数,从而加快算法的收敛速度。为展示新算法的效率,我们以美国西部MTBS跨机构野火数据集为例,估计作为生态特征函数的火灾发生概率。数值结果表明,我们的算法比现有最优算法快一个数量级,且结果与野火发生的物理模型及以往野火驱动因素的统计分析相吻合。