Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e.g., image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality. To address this, we propose an end-to-end framework, termed VMLoc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting. Our model is extensively evaluated on RGB-D datasets and the results prove the efficacy of our model. The source code is available at https://github.com/kaichen-z/VMLoc.
翻译:近期基于学习的方法在单次相机定位领域取得了显著成果。然而,如何更好地融合多种模态(如图像与深度信息)并应对输入退化或缺失的问题,仍缺乏深入研究。我们注意到,以往基于深度融合的方法在性能上并未显著优于仅使用单一模态的模型。我们推测这是由于通过求和或拼接进行的特征空间融合方法较为简单,未能充分考虑不同模态的独特优势。为应对这一挑战,我们提出了一种端到端框架,命名为VMLoc,通过变分专家乘积(PoE)将不同传感器输入融合至统一潜在空间,并随后采用基于注意力的融合机制。与以往直接套用标准变分自编码器目标函数的多模态变分方法不同,我们展示了如何通过基于重要性加权的无偏目标函数精确估计相机定位。我们在RGB-D数据集上进行了广泛评估,实验结果验证了模型的有效性。源代码已开源至https://github.com/kaichen-z/VMLoc。