The need for multimodal data integration arises naturally when multiple complementary sets of features are measured on the same sample. Under a dependent multifactor model, we develop a fully data-driven orchestrated approximate message passing algorithm for integrating information across these feature sets to achieve statistically optimal signal recovery. In practice, these reference data sets are often queried later by new subjects that are only partially observed. Leveraging on asymptotic normality of estimates generated by our data integration method, we further develop an asymptotically valid prediction set for the latent representation of any such query subject. We demonstrate the prowess of both the data integration and the prediction set construction algorithms on both synthetic examples and real world single-cell datasets.
翻译:当同一样本上测量到多个互补特征集时,多模态数据整合的需求自然产生。基于依赖多因子模型,我们开发了一种完全数据驱动的协同近似消息传递算法,用于整合跨特征集信息以实现统计最优的信号恢复。实践中,这些参考数据集常被仅部分观测的新样本所查询。利用我们数据集成方法生成的估计量的渐近正态性,我们进一步构建了对此类查询样本潜在表征的渐近有效预测集。我们通过合成示例和真实世界单细胞数据集,验证了数据集成算法与预测集构建算法的优越性能。