We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs). By elegantly coupling these complementary worlds through a dual-branch architecture, we seamlessly integrate stereo matching with learned contextual cues. Following this design, our framework introduces novel cost volume fusion mechanisms that effectively handle critical challenges such as textureless regions, occlusions, and non-Lambertian surfaces. Through our novel optical illusion dataset, MonoTrap, and extensive evaluation across multiple benchmarks, we demonstrate that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions while showing remarkable robustness to challenging cases such as mirrors and transparencies.
翻译:本文提出"立体无处不在"(Stereo Anywhere)——一种新颖的立体匹配框架,该框架将几何约束与来自单目深度视觉基础模型(VFMs)的鲁棒先验知识相结合。通过双分支架构将这两个互补领域优雅耦合,我们实现了立体匹配与学习上下文线索的无缝集成。基于此设计,本框架提出了新型代价体融合机制,能有效处理无纹理区域、遮挡和非朗伯表面等关键挑战。通过我们新构建的光学错觉数据集MonoTrap以及在多个基准测试上的广泛评估,我们证明仅通过合成数据训练的模型在零样本泛化中达到了最先进的性能,显著优于现有解决方案,同时对镜面反射和透明物体等挑战性场景展现出卓越的鲁棒性。