Statistical models are central to machine learning with broad applicability across a range of downstream tasks. The models are controlled by free parameters that are typically estimated from data by maximum-likelihood estimation or approximations thereof. However, when faced with real-world data sets many of the models run into a critical issue: they are formulated in terms of fully-observed data, whereas in practice the data sets are plagued with missing data. The theory of statistical model estimation from incomplete data is conceptually similar to the estimation of latent-variable models, where powerful tools such as variational inference (VI) exist. However, in contrast to standard latent-variable models, parameter estimation with incomplete data often requires estimating exponentially-many conditional distributions of the missing variables, hence making standard VI methods intractable. We address this gap by introducing variational Gibbs inference (VGI), a new general-purpose method to estimate the parameters of statistical models from incomplete data. We validate VGI on a set of synthetic and real-world estimation tasks, estimating important machine learning models such as variational autoencoders and normalising flows from incomplete data. The proposed method, whilst general-purpose, achieves competitive or better performance than existing model-specific estimation methods.
翻译:统计模型是机器学习的核心,在各类下游任务中具有广泛适用性。这些模型由自由参数控制,通常通过最大似然估计或其近似方法从数据中估计。然而,面对真实数据集时,许多模型会遇到一个关键问题:它们基于完全观测数据构建,而实际数据集却普遍存在缺失数据。从不完全数据中估计统计模型的理论在概念上类似于潜变量模型的估计,后者已有变分推断(VI)等强大工具。但与标准潜变量模型不同,处理不完全数据的参数估计通常需要估计缺失变量的指数级数量条件分布,这使得标准VI方法难以处理。我们通过引入变分吉布斯推断(VGI)解决了这一局限——这是一种从含缺失数据中估计统计模型参数的新通用方法。我们在合成数据和真实世界的估计任务上验证了VGI,使用该方法从不完全数据中估计了变分自编码器与归一化流等重要机器学习模型。该方法虽为通用方法,但能达到或优于现有模型特定估计方法的性能。