Stochastic Learning of Non-Conjugate Variational Posterior for Image Classification

Large scale Bayesian nonparametrics (BNP) learner such as stochastic variational inference (SVI) can handle datasets with large class number and large training size at fractional cost. Like its predecessor, SVI rely on the assumption of conjugate variational posterior to approximate the true posterior. A more challenging problem is to consider large scale learning on non-conjugate posterior. Recent works in this direction are mostly associated with using Monte Carlo methods for approximating the learner. However, these works are usually demonstrated on non-BNP related task and less complex models such as logistic regression, due to higher computational complexity. In order to overcome the issue faced by SVI, we develop a novel approach based on the recently proposed variational maximization-maximization (VMM) learner to allow large scale learning on non-conjugate posterior. Unlike SVI, our VMM learner does not require closed-form expression for the variational posterior expectatations. Our only requirement is that the variational posterior is differentiable. In order to ensure convergence in stochastic settings, SVI rely on decaying step-sizes to slow its learning. Inspired by SVI and Adam, we propose the novel use of decaying step-sizes on both gradient and ascent direction in our VMM to significantly improve its learning. We show that our proposed methods is compatible with ResNet features when applied to large class number datasets such as MIT67 and SUN397. Finally, we compare our proposed learner with several recent works such as deep clustering algorithms and showed we were able to produce on par or outperform the state-of-the-art methods in terms of clustering measures.

翻译：大规模贝叶斯非参数学习器（如随机变分推断）能够以较低成本处理具有大量类别和大规模训练集的数据。与先前方法类似，随机变分推断依赖于共轭变分后验的假设来逼近真实后验。更具挑战性的问题是在非共轭后验上进行大规模学习。该方向的最新研究主要集中于使用蒙特卡洛方法逼近学习器，但由于计算复杂度较高，这些工作通常仅在逻辑回归等非贝叶斯非参数相关任务和较简单模型上得到验证。为克服随机变分推断的局限性，我们基于最新提出的变分最大化-最大化学习器开发了一种新方法，实现了非共轭后验的大规模学习。与随机变分推断不同，我们的变分最大化-最大化学习器不需要变分后验期望的闭式表达式，仅要求变分后验可微。为确保随机环境下的收敛性，随机变分推断依赖衰减步长来减缓学习速度。受随机变分推断和Adam优化器的启发，我们创新性地在变分最大化-最大化中同时对梯度和上升方向应用衰减步长，显著提升了学习效率。实验表明，当应用于MIT67和SUN397等大规模类别数据集时，我们提出的方法与ResNet特征兼容。最后，通过将所提学习器与深度聚类算法等最新研究进行比较，我们在聚类度量指标上达到了与当前最优方法相当或更优的性能。