For approximating a target distribution given only its unnormalized log-density, stochastic gradient-based variational inference (VI) algorithms are a popular approach. For example, Wasserstein VI (WVI) and black-box VI (BBVI) perform gradient descent in measure space (Bures-Wasserstein space) and parameter space, respectively. Previously, for the Gaussian variational family, convergence guarantees for WVI have shown superiority over existing results for black-box VI with the reparametrization gradient, suggesting the measure space approach might provide some unique benefits. In this work, however, we close this gap by obtaining identical state-of-the-art iteration complexity guarantees for both. In particular, we identify that WVI's superiority stems from the specific gradient estimator it uses, which BBVI can also leverage with minor modifications. The estimator in question is usually associated with Price's theorem and utilizes second-order information (Hessians) of the target log-density. We will refer to this as Price's gradient. On the flip side, WVI can be made more widely applicable by using the reparametrization gradient, which requires only gradients of the log-density. We empirically demonstrate that the use of Price's gradient is the major source of performance improvement.
翻译:对于仅给定未归一化对数密度函数的目标分布逼近问题,基于随机梯度的变分推断(VI)算法是一种常用方法。例如,Wasserstein变分推断(WVI)与黑盒变分推断(BBVI)分别基于测度空间(Bures-Wasserstein空间)与参数空间执行梯度下降。先前针对高斯变分族的研究表明,WVI的收敛性保证优于采用重参数化梯度的黑盒变分推断的现有结果,这暗示测度空间方法可能具有独特优势。然而,本研究通过为两者获得相同的最优迭代复杂度保证,弥合了这一差距。具体而言,我们发现WVI的优势源于其采用的特定梯度估计器,而BBVI只需稍作修改亦可利用该估计器。该估计器通常与Price定理相关联,并利用目标对数密度的二阶信息(海森矩阵),我们将其称为Price梯度。另一方面,通过采用仅需对数密度梯度的重参数化梯度,WVI亦可拓展其应用范围。我们通过实验证明,Price梯度的使用是性能提升的主要来源。