Obtaining a reliable estimate of the joint probability mass function (PMF) of a set of random variables from observed data is a significant objective in statistical signal processing and machine learning. Modelling the joint PMF as a tensor that admits a low-rank canonical polyadic decomposition (CPD) has enabled the development of efficient PMF estimation algorithms. However, these algorithms require the rank (model order) of the tensor to be specified beforehand. In real-world applications, the true rank is unknown. Therefore, an appropriate rank is usually selected from a candidate set either by observing validation errors or by computing various likelihood-based information criteria, a procedure which is computationally expensive for large datasets. This paper presents a novel Bayesian framework for estimating the joint PMF and automatically inferring its rank from observed data. We specify a Bayesian PMF estimation model and employ appropriate prior distributions for the model parameters, allowing for tuning-free rank inference via a single training run. We then derive a deterministic solution based on variational inference (VI) to approximate the posterior distributions of various model parameters. Additionally, we develop a scalable version of the VI-based approach by leveraging stochastic variational inference (SVI) to arrive at an efficient algorithm whose complexity scales sublinearly with the size of the dataset. Numerical experiments involving both synthetic data and real movie recommendation data illustrate the advantages of our VI and SVI-based methods in terms of estimation accuracy, automatic rank detection, and computational efficiency.
翻译:从观测数据中可靠地估计一组随机变量的联合概率质量函数(PMF)是统计信号处理和机器学习中的一个重要目标。将联合PMF建模为允许低秩规范多元分解(CPD)的张量,使得高效PMF估计算法的发展成为可能。然而,这些算法需要预先指定张量的秩(模型阶数)。在实际应用中,真实秩是未知的。因此,通常通过观察验证误差或计算各种基于似然的信息准则从候选集中选择一个合适的秩,这一过程对于大型数据集计算成本高昂。本文提出了一种新颖的贝叶斯框架,用于从观测数据中估计联合PMF并自动推断其秩。我们指定了一个贝叶斯PMF估计模型,并为模型参数采用了适当的先验分布,从而允许通过单次训练运行实现免调优的秩推断。随后,我们推导了一种基于变分推断(VI)的确定性解法,以近似各种模型参数的后验分布。此外,我们通过利用随机变分推断(SVI)开发了基于VI方法的可扩展版本,从而得到一种计算复杂度随数据集大小次线性增长的高效算法。涉及合成数据和真实电影推荐数据的数值实验,展示了我们基于VI和SVI的方法在估计精度、自动秩检测和计算效率方面的优势。