FocDepthFormer: Transformer with LSTM for Depth Estimation from Focus

Depth estimation from focal stacks is a fundamental computer vision problem that aims to infer depth from focus/defocus cues in the image stacks. Most existing methods tackle this problem by applying convolutional neural networks (CNNs) with 2D or 3D convolutions over a set of fixed stack images to learn features across images and stacks. Their performance is restricted due to the local properties of the CNNs, and they are constrained to process a fixed number of stacks consistent in train and inference, limiting the generalization to the arbitrary length of stacks. To handle the above limitations, we develop a novel Transformer-based network, FocDepthFormer, composed mainly of a Transformer with an LSTM module and a CNN decoder. The self-attention in Transformer enables learning more informative features via an implicit non-local cross reference. The LSTM module is learned to integrate the representations across the stack with arbitrary images. To directly capture the low-level features of various degrees of focus/defocus, we propose to use multi-scale convolutional kernels in an early-stage encoder. Benefiting from the design with LSTM, our FocDepthFormer can be pre-trained with abundant monocular RGB depth estimation data for visual pattern capturing, alleviating the demand for the hard-to-collect focal stack data. Extensive experiments on various focal stack benchmark datasets show that our model outperforms the state-of-the-art models on multiple metrics.

翻译：从对焦堆栈中估计深度是一个基础计算机视觉问题，旨在从图像堆栈中的对焦/离焦线索推断深度。现有方法大多通过应用2D或3D卷积的卷积神经网络处理固定数量的堆栈图像，以学习跨图像和堆栈的特征。由于CNN的局部特性，其性能受到限制，且模型只能处理与训练一致的固定堆栈数量，难以泛化至任意长度的堆栈。为解决上述问题，我们提出一种新颖的基于Transformer的网络FocDepthFormer，主要由带LSTM模块的Transformer和CNN解码器组成。Transformer中的自注意力机制通过隐式非局部交叉参考学习更具信息量的特征，而LSTM模块则学习融合任意图像堆栈的表示。为直接捕捉不同程度对焦/离焦的低层特征，我们在早期编码器中采用多尺度卷积核。得益于LSTM的设计，FocDepthFormer可利用丰富的单目RGB深度估计数据进行预训练以捕捉视觉模式，从而缓解难以采集的对焦堆栈数据需求。在多个对焦堆栈基准数据集上的广泛实验表明，我们的模型在多项指标上均优于现有最优模型。

相关内容

长短期记忆网络

关注 120

长短期记忆网络(LSTM)是一种用于深度学习领域的人工回归神经网络(RNN)结构。与标准的前馈神经网络不同，LSTM具有反馈连接。它不仅可以处理单个数据点(如图像)，还可以处理整个数据序列(如语音或视频)。例如，LSTM适用于未分段、连接的手写识别、语音识别、网络流量或IDSs(入侵检测系统)中的异常检测等任务。

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日