In this study, we address the key challenges concerning the accuracy and effectiveness of depth estimation for endoscopic imaging, with a particular emphasis on real-time inference and the impact of light reflections. We propose a novel lightweight solution named EndoDepthL that integrates Convolutional Neural Networks (CNN) and Transformers to predict multi-scale depth maps. Our approach includes optimizing the network architecture, incorporating multi-scale dilated convolution, and a multi-channel attention mechanism. We also introduce a statistical confidence boundary mask to minimize the impact of reflective areas. To better evaluate the performance of monocular depth estimation in endoscopic imaging, we propose a novel complexity evaluation metric that considers network parameter size, floating-point operations, and inference frames per second. We comprehensively evaluate our proposed method and compare it with existing baseline solutions. The results demonstrate that EndoDepthL ensures depth estimation accuracy with a lightweight structure.
翻译:本研究针对内窥镜成像中深度估计的精度与有效性关键挑战展开探讨,特别关注实时推理性能及光线反射影响。我们提出一种名为EndoDepthL的新型轻量级解决方案,该方案整合卷积神经网络(CNN)与Transformer架构以预测多尺度深度图。本方法包括网络架构优化、多尺度空洞卷积引入以及多通道注意力机制。此外,我们引入统计置信边界掩膜以降低反射区域的影响。为更好地评估内窥镜成像中单目深度估计性能,我们提出一种综合考虑网络参数量、浮点运算次数及推理帧率的新型复杂度评估指标。我们对所提方法进行全面评估,并与现有基线方案进行对比。结果表明,EndoDepthL在保持轻量化结构的同时确保了深度估计精度。