This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency. Thus, instead of directly developing a depth estimator from scratch, we reformulate this predictive task into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, we design an effective training strategy to provide context within a clip. Extensive experimental results validate our design choices and demonstrate the superiority of our approach, dubbed ChronoDepth. Project page: https://xdimlab.github.io/ChronoDepth/.
翻译:本研究针对流式视频深度估计的挑战,该任务不仅要求单帧精度,更重要的是跨帧一致性。我们认为帧间或片段间共享上下文信息对于促进时序一致性至关重要。因此,我们并非从零开始直接构建深度估计器,而是将此预测任务重新定义为条件生成问题,以提供片段内及跨片段的上下文信息。具体而言,我们提出了一种适用于任意长度视频的一致性上下文感知训练与推理策略,以实现跨片段上下文传递。在训练过程中,我们对片段内每帧采样独立的噪声水平;在推理时采用滑动窗口策略,并将重叠帧初始化为先前预测帧(不添加噪声)。此外,我们设计了有效的训练策略以提供片段内上下文。大量实验结果验证了我们设计方案的有效性,并证明了所提方法(命名为ChronoDepth)的优越性。项目页面:https://xdimlab.github.io/ChronoDepth/。