Compared to the multi-stage self-supervised multi-view stereo (MVS) method, the end-to-end (E2E) approach has received more attention due to its concise and efficient training pipeline. Recent E2E self-supervised MVS approaches have integrated third-party models (such as optical flow models, semantic segmentation models, NeRF models, etc.) to provide additional consistency constraints, which grows GPU memory consumption and complicates the model's structure and training pipeline. In this work, we propose an efficient framework for end-to-end self-supervised MVS, dubbed ES-MVSNet. To alleviate the high memory consumption of current E2E self-supervised MVS frameworks, we present a memory-efficient architecture that reduces memory usage by 43% without compromising model performance. Furthermore, with the novel design of asymmetric view selection policy and region-aware depth consistency, we achieve state-of-the-art performance among E2E self-supervised MVS methods, without relying on third-party models for additional consistency signals. Extensive experiments on DTU and Tanks&Temples benchmarks demonstrate that the proposed ES-MVSNet approach achieves state-of-the-art performance among E2E self-supervised MVS methods and competitive performance to many supervised and multi-stage self-supervised methods.
翻译:与多阶段自监督多视图立体视觉(MVS)方法相比,端到端(E2E)方法因其简洁高效的训练流程而受到更多关注。近年来的E2E自监督MVS方法引入了第三方模型(如光流模型、语义分割模型、NeRF模型等)以提供额外的一致性约束,这增加了GPU内存消耗,并使模型结构与训练流程复杂化。本文提出了一种名为ES-MVSNet的高效端到端自监督MVS框架。为缓解当前E2E自监督MVS框架的高内存消耗问题,我们设计了一种内存高效架构,在不影响模型性能的情况下将内存使用量降低43%。此外,通过创新的不对称视图选择策略和区域感知深度一致性设计,我们在不依赖第三方模型提供额外一致性信号的情况下,实现了E2E自监督MVS方法中的最先进性能。在DTU和Tanks&Temples基准测试上的大量实验表明,所提出的ES-MVSNet方法在E2E自监督MVS方法中达到了最先进性能,并与许多有监督和多阶段自监督方法相比具有竞争力的表现。