A Practical Stereo Depth System for Smart Glasses

Jialiang Wang,Daniel Scharstein,Akash Bapat,Kevin Blackburn-Matzen,Matthew Yu,Jonathan Lehman,Suhib Alsisan,Yanghan Wang,Sam Tsai,Jan-Michael Frahm,Zijian He,Peter Vajda,Michael F. Cohen,Matt Uyttendaele

from arxiv, Accepted at CVPR2023

We present the design of a productionized end-to-end stereo depth sensing system that does pre-processing, online stereo rectification, and stereo depth estimation with a fallback to monocular depth estimation when rectification is unreliable. The output of our depth sensing system is then used in a novel view generation pipeline to create 3D computational photography effects using point-of-view images captured by smart glasses. All these steps are executed on-device on the stringent compute budget of a mobile phone, and because we expect the users can use a wide range of smartphones, our design needs to be general and cannot be dependent on a particular hardware or ML accelerator such as a smartphone GPU. Although each of these steps is well studied, a description of a practical system is still lacking. For such a system, all these steps need to work in tandem with one another and fallback gracefully on failures within the system or less than ideal input data. We show how we handle unforeseen changes to calibration, e.g., due to heat, robustly support depth estimation in the wild, and still abide by the memory and latency constraints required for a smooth user experience. We show that our trained models are fast, and run in less than 1s on a six-year-old Samsung Galaxy S8 phone's CPU. Our models generalize well to unseen data and achieve good results on Middlebury and in-the-wild images captured from the smart glasses.

翻译：我们提出了一套生产级端到端立体深度感知系统的设计方案，该系统包含预处理、在线立体校正和立体深度估计功能，在校正不可靠时自动回退至单目深度估计方案。深度感知系统的输出结果随后被应用于新颖视图生成管线，通过智能眼镜采集的第一人称视角图像创建三维计算摄影特效。所有处理步骤均在移动端严苛的计算预算下实现设备端运行，同时考虑到用户可能使用多种智能手机，我们的设计必须保持通用性，不能依赖特定硬件或机器学习加速器（如手机GPU）。尽管各个环节已有成熟研究，但针对实用系统的完整描述仍属空白。对于此类系统，所有模块需协同工作，并在系统故障或输入数据不理想时优雅降级。我们展示了如何处理温度变化等导致的校准参数突变，鲁棒地支持野外场景深度估计，同时严格遵守用户体验所需的存储器和延迟约束。实验表明，我们训练的模型运行高效，在六年前的三星Galaxy S8手机CPU上推理时间不超过1秒。该模型对未见数据具有良好的泛化能力，在Middlebury数据集及智能眼镜拍摄的野外图像上均取得优异效果。