In recent years, many video tasks have achieved breakthroughs by utilizing the vision transformer and establishing spatial-temporal decoupling for feature extraction. Although multi-view 3D reconstruction also faces multiple images as input, it cannot immediately inherit their success due to completely ambiguous associations between unordered views. There is not usable prior relationship, which is similar to the temporally-coherence property in a video. To solve this problem, we propose a novel transformer network for Unordered Multiple Images (UMIFormer). It exploits transformer blocks for decoupled intra-view encoding and designed blocks for token rectification that mine the correlation between similar tokens from different views to achieve decoupled inter-view encoding. Afterward, all tokens acquired from various branches are compressed into a fixed-size compact representation while preserving rich information for reconstruction by leveraging the similarities between tokens. We empirically demonstrate on ShapeNet and confirm that our decoupled learning method is adaptable for unordered multiple images. Meanwhile, the experiments also verify our model outperforms existing SOTA methods by a large margin.
翻译:近年来,许多视频任务通过利用视觉Transformer并在特征提取中建立时空解耦取得了突破。虽然多视图三维重建同样以多张图像作为输入,但由于无序视图之间完全模糊的关联性,这些任务的成功无法直接迁移——视频中存在的时间连贯性先验关系在此处并不适用。为解决这一问题,我们提出了一种面向无序多图像(UMI)的新型Transformer网络UMIFormer。该方法采用Transformer模块实现解耦的视图内编码,并设计专用令牌矫正模块,通过挖掘不同视图中相似令牌间的相关性实现解耦的视图间编码。随后,利用令牌间的相似性,将从不同分支获取的所有令牌压缩为固定大小的紧凑表示,同时保留重建所需的丰富信息。我们在ShapeNet上的实验表明,所提出的解耦学习方法适用于无序多图像,且模型性能大幅超越现有最先进方法。