Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device. We will release both, code and compilation scripts, making oVDA easy to deploy on low-power hardware.
翻译:单目视频深度估计已成为许多现实世界计算机视觉系统的关键组成部分。近期,Video Depth Anything(VDA)在长视频序列上展现出卓越性能。然而,该方法依赖批处理机制,无法应用于在线场景。本研究突破此限制,提出了在线VDA(oVDA)方法。其核心创新在于借鉴大语言模型(LLMs)的技术思路:在推理过程中缓存潜在特征,并在训练时采用帧掩码策略。我们的oVDA方法在精度和显存使用效率上均优于所有现有在线视频深度估计方法。低显存消耗对于边缘设备部署尤为重要。实验表明,oVDA在NVIDIA A100上可实现42 FPS运行,在NVIDIA Jetson边缘设备上可达20 FPS。我们将公开代码与编译脚本,使oVDA能够便捷地部署于低功耗硬件平台。