Depth estimation from focal stacks is a fundamental computer vision problem that aims to infer depth from focus/defocus cues in the image stacks. Most existing methods tackle this problem by applying convolutional neural networks (CNNs) with 2D or 3D convolutions over a set of fixed stack images to learn features across images and stacks. Their performance is restricted due to the local properties of the CNNs, and they are constrained to process a fixed number of stacks consistent in train and inference, limiting the generalization to the arbitrary length of stacks. To handle the above limitations, we develop a novel Transformer-based network, FocDepthFormer, composed mainly of a Transformer with an LSTM module and a CNN decoder. The self-attention in Transformer enables learning more informative features via an implicit non-local cross reference. The LSTM module is learned to integrate the representations across the stack with arbitrary images. To directly capture the low-level features of various degrees of focus/defocus, we propose to use multi-scale convolutional kernels in an early-stage encoder. Benefiting from the design with LSTM, our FocDepthFormer can be pre-trained with abundant monocular RGB depth estimation data for visual pattern capturing, alleviating the demand for the hard-to-collect focal stack data. Extensive experiments on various focal stack benchmark datasets show that our model outperforms the state-of-the-art models on multiple metrics.
翻译:从对焦堆栈中估计深度是一个基础计算机视觉问题,旨在从图像堆栈中的对焦/离焦线索推断深度。现有方法大多通过应用2D或3D卷积的卷积神经网络处理固定数量的堆栈图像,以学习跨图像和堆栈的特征。由于CNN的局部特性,其性能受到限制,且模型只能处理与训练一致的固定堆栈数量,难以泛化至任意长度的堆栈。为解决上述问题,我们提出一种新颖的基于Transformer的网络FocDepthFormer,主要由带LSTM模块的Transformer和CNN解码器组成。Transformer中的自注意力机制通过隐式非局部交叉参考学习更具信息量的特征,而LSTM模块则学习融合任意图像堆栈的表示。为直接捕捉不同程度对焦/离焦的低层特征,我们在早期编码器中采用多尺度卷积核。得益于LSTM的设计,FocDepthFormer可利用丰富的单目RGB深度估计数据进行预训练以捕捉视觉模式,从而缓解难以采集的对焦堆栈数据需求。在多个对焦堆栈基准数据集上的广泛实验表明,我们的模型在多项指标上均优于现有最优模型。