Learning-based multi-view stereo (MVS) method heavily relies on feature matching, which requires distinctive and descriptive representations. An effective solution is to apply non-local feature aggregation, e.g., Transformer. Albeit useful, these techniques introduce heavy computation overheads for MVS. Each pixel densely attends to the whole image. In contrast, we propose to constrain non-local feature augmentation within a pair of lines: each point only attends the corresponding pair of epipolar lines. Our idea takes inspiration from the classic epipolar geometry, which shows that one point with different depth hypotheses will be projected to the epipolar line on the other view. This constraint reduces the 2D search space into the epipolar line in stereo matching. Similarly, this suggests that the matching of MVS is to distinguish a series of points lying on the same line. Inspired by this point-to-line search, we devise a line-to-point non-local augmentation strategy. We first devise an optimized searching algorithm to split the 2D feature maps into epipolar line pairs. Then, an Epipolar Transformer (ET) performs non-local feature augmentation among epipolar line pairs. We incorporate the ET into a learning-based MVS baseline, named ET-MVSNet. ET-MVSNet achieves state-of-the-art reconstruction performance on both the DTU and Tanks-and-Temples benchmark with high efficiency. Code is available at https://github.com/TQTQliu/ET-MVSNet.
翻译:基于学习的多视角立体(MVS)方法高度依赖特征匹配,这要求具有独特且描述性的特征表示。一种有效的解决方案是应用非局部特征聚合,例如Transformer。尽管这些技术有效,但它们为MVS引入了巨大的计算开销——每个像素需要密集地关注整幅图像。相比之下,我们提出将非局部特征增强约束在线对范围内:每个点仅关注对应的极线对。该思路源于经典极线几何的启发:在另一视角中,具有不同深度假设的点将投影到极线上。这一约束将立体匹配中的二维搜索空间降为一维极线搜索。类似地,这表明MVS的匹配本质是区分位于同一条直线上的系列点。受这种点对线搜索的启发,我们设计了一种线对点非局部增强策略。首先优化搜索算法将二维特征图划分为极线对,随后通过极线变换器(Epipolar Transformer,ET)在极线对间进行非局部特征增强。我们将ET集成到基于学习的MVS基线中,提出ET-MVSNet。ET-MVSNet在DTU和Tanks-and-Temples基准测试中均实现了高效且最先进的重建性能。代码开源地址:https://github.com/TQTQliu/ET-MVSNet。