This paper studies the qualitative behavior and robustness of two variants of Minimal Random Code Learning (MIRACLE) used to compress variational Bayesian neural networks. MIRACLE implements a powerful, conditionally Gaussian variational approximation for the weight posterior $Q_{\mathbf{w}}$ and uses relative entropy coding to compress a weight sample from the posterior using a Gaussian coding distribution $P_{\mathbf{w}}$. To achieve the desired compression rate, $D_{\mathrm{KL}}[Q_{\mathbf{w}} \Vert P_{\mathbf{w}}]$ must be constrained, which requires a computationally expensive annealing procedure under the conventional mean-variance (Mean-Var) parameterization for $Q_{\mathbf{w}}$. Instead, we parameterize $Q_{\mathbf{w}}$ by its mean and KL divergence from $P_{\mathbf{w}}$ to constrain the compression cost to the desired value by construction. We demonstrate that variational training with Mean-KL parameterization converges twice as fast and maintains predictive performance after compression. Furthermore, we show that Mean-KL leads to more meaningful variational distributions with heavier tails and compressed weight samples which are more robust to pruning.
翻译:本文研究了用于压缩变分贝叶斯神经网络的最小随机编码学习(MIRACLE)两种变体的定性行为与鲁棒性。MIRACLE为权重后验$Q_{\mathbf{w}}$实现了强大的条件高斯变分近似,并利用相对熵编码通过高斯编码分布$P_{\mathbf{w}}$压缩来自后验的权重样本。为达到预期压缩率,必须约束$D_{\mathrm{KL}}[Q_{\mathbf{w}} \Vert P_{\mathbf{w}}]$,这要求在传统的均值-方差(Mean-Var)参数化下对$Q_{\mathbf{w}}$进行计算成本高昂的退火过程。我们转而通过均值和相对于$P_{\mathbf{w}}$的KL散度对$Q_{\mathbf{w}}$进行参数化,从而在构造上约束压缩代价达到目标值。我们证明基于均值-KL参数化的变分训练收敛速度提高两倍,且压缩后仍保持预测性能。此外,我们表明均值-KL参数化能产生更具意义的变分分布(具有更重的拖尾)和更鲁棒于剪枝的压缩权重样本。