Bayesian multidimensional scaling (BMDS) is a probabilistic dimension reduction tool that allows one to model and visualize data consisting of dissimilarities between pairs of objects. Although BMDS has proven useful within, e.g., Bayesian phylogenetic inference, its likelihood and gradient calculations require a burdensome order of $N^2$ floating-point operations, where $N$ is the number of data points. Thus, BMDS becomes impractical as $N$ grows large. We propose and compare two sparse versions of BMDS (sBMDS) that apply log-likelihood and gradient computations to subsets of the observed dissimilarity matrix data. Landmark sBMDS (L-sBMDS) extracts columns, while banded sBMDS (B-sBMDS) extracts diagonals of the data. These sparse variants let one specify a time complexity between $N^2$ and $N$. Under simplified settings, we prove posterior consistency for subsampled distance matrices. Through simulations, we examine the accuracy and computational efficiency across all models using both the Metropolis-Hastings and Hamiltonian Monte Carlo algorithms. We observe approximately 3-fold, 10-fold and 40-fold speedups with negligible loss of accuracy, when applying the sBMDS likelihoods and gradients to 500, 1,000 and 5,000 data points with 50 bands (landmarks); these speedups only increase with the size of data considered. Finally, we apply the sBMDS variants to the phylogeographic modeling of multiple influenza subtypes to better understand how these strains spread through global air transportation networks.
翻译:贝叶斯多维标度法(BMDS)是一种概率性降维工具,可用于建模和可视化由对象间相异性构成的数据。尽管BMDS在贝叶斯系统发育推断等领域已证明其有效性,但其似然函数与梯度计算需要消耗高达$N^2$量级的浮点运算($N$为数据点数量),导致数据规模增大时计算难以实施。本文提出并比较了两种稀疏化BMDS变体(sBMDS),通过对观测相异性矩阵数据的子集进行对数似然与梯度计算来实现加速。地标稀疏法(L-sBMDS)提取数据矩阵的列向量,带状稀疏法(B-sBMDS)则提取对角线区域数据。这些稀疏变体允许将时间复杂度控制在$N^2$至$N$之间。在简化设定下,我们证明了子采样距离矩阵的后验一致性。通过模拟实验,结合Metropolis-Hastings与哈密顿蒙特卡洛算法,系统评估了所有模型的精度与计算效率。实验表明:当对500、1000和5000个数据点(设置50个地标/带状区域)应用sBMDS的似然与梯度计算时,在精度损失可忽略的前提下分别获得约3倍、10倍和40倍的加速效果,且加速效益随数据规模扩大持续提升。最后,我们将sBMDS变体应用于多种流感亚型的系统地理学建模,以深入探究这些毒株如何通过全球航空运输网络进行传播。