面向AR眼镜非稳定立体相机系统的高效深度估计 (Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses)

Stereo depth estimation is a fundamental component in augmented reality (AR) applications. Although AR applications require very low latency for their real-time applications, traditional depth estimation models often rely on time-consuming preprocessing steps such as rectification to achieve high accuracy. Also, non standard ML operator based algorithms such as cost volume also require significant latency, which is aggravated on compute resource-constrained mobile platforms. Therefore, we develop hardware-friendly alternatives to the costly cost volume and preprocessing and design two new models based on them, MultiHeadDepth and HomoDepth. Our approaches for cost volume is replacing it with a new group-pointwise convolution-based operator and approximation of consine similarity based on layernorm and dot product. For online stereo rectification (preprocessing), we introduce homograhy matrix prediction network with a rectification positional encoding (RPE), which delivers both low latency and robustness to unrectified images, which eliminates the needs for preprocessing. Our MultiHeadDepth, which includes optimized cost volume, provides 11.8-30.3% improvements in accuracy and 22.9-25.2% reduction in latency compared to a state-of-the-art depth estimation model for AR glasses from industry. Our HomoDepth, which includes optimized preprocessing (Homograhpy + RPE) upon MultiHeadDepth, can process unrectified images and reduce the end-to-end latency by 44.5%. We adopt a multi-task learning framework to handle misaligned stereo inputs on HomoDepth, which reduces theAbsRel error by 10.0-24.3%. The results demonstrate the efficacy of our approaches in achieving both high model performance with low latency, which makes a step forward toward practical depth estimation on future AR devices.

翻译：立体深度估计是增强现实（AR）应用中的基础组件。尽管AR应用对其实时性要求极低的延迟，但传统的深度估计模型通常依赖耗时的预处理步骤（如校正）以实现高精度。此外，基于非标准机器学习算子的算法（如代价体积）也需要显著的延迟，这在计算资源受限的移动平台上更为严重。因此，我们针对高成本的代价体积和预处理开发了硬件友好的替代方案，并基于此设计了两个新模型：MultiHeadDepth与HomoDepth。我们针对代价体积的方法是用一种新的基于分组逐点卷积的算子以及基于层归一化与点积的余弦相似度近似来替代它。对于在线立体校正（预处理），我们引入了带有校正位置编码（RPE）的单应性矩阵预测网络，该网络在提供低延迟的同时，对未校正图像具有鲁棒性，从而消除了预处理的需求。我们的MultiHeadDepth（包含优化的代价体积）与业界最先进的AR眼镜深度估计模型相比，在精度上提升了11.8-30.3%，延迟降低了22.9-25.2%。我们的HomoDepth（在MultiHeadDepth基础上集成了优化的预处理，即单应性矩阵+RPE）能够处理未校正图像，并将端到端延迟降低了44.5%。我们在HomoDepth上采用多任务学习框架来处理未对齐的立体输入，这使AbsRel误差降低了10.0-24.3%。实验结果证明了我们的方法在实现高模型性能与低延迟方面的有效性，为推动未来AR设备上实用的深度估计技术向前迈进了一步。