We consider the problem of reconstructing a dynamic scene observed from a stereo camera. Most existing methods for depth from stereo treat different stereo frames independently, leading to temporally inconsistent depth predictions. Temporal consistency is especially important for immersive AR or VR scenarios, where flickering greatly diminishes the user experience. We propose DynamicStereo, a novel transformer-based architecture to estimate disparity for stereo videos. The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions. Our architecture is designed to process stereo videos efficiently through divided attention layers. We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments, which provides complementary training and evaluation data for dynamic stereo closer to real applications than existing datasets. Training with this dataset further improves the quality of predictions of our proposed DynamicStereo as well as prior methods. Finally, it acts as a benchmark for consistent stereo methods.
翻译:我们研究了从立体摄像头观察的动态场景重建问题。现有的大多数立体深度估计方法将不同立体帧独立处理,导致深度预测在时间上不一致。时间一致性对于沉浸式增强现实(AR)或虚拟现实(VR)场景尤为重要,因为闪烁会极大降低用户体验。我们提出DynamicStereo,一种基于Transformer的新型架构,用于估计立体视频的视差。该网络通过学习汇集相邻帧的信息,提升预测的时间一致性。我们的架构通过分割注意力层设计,能够高效处理立体视频。此外,我们引入了Dynamic Replica,一个包含扫描环境中人物和动物合成视频的新基准数据集,其提供的训练与评估数据比现有数据集更贴近实际应用中的动态立体场景。使用该数据集训练不仅能提升我们提出的DynamicStereo方法的预测质量,也能改善先前方法的表现。最后,该数据集为一致性的立体估计方法提供了一个基准。