Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction

CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Most of the current studies on MTL solely rely on CNN or Transformer. In this work, we present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction. This combination may offer a simple and efficient solution owing to its powerful and flexible task-specific learning and advantages of lower cost, less complexity and smaller parameters than the traditional MTL methods. We introduce deformable mixer Transformer with gating (DeMTG), a simple and effective encoder-decoder architecture up-to-date that incorporates the convolution and attention mechanism in a unified network for MTL. It is exquisitely designed to use advantages of each block, and provide deformable and comprehensive features for all tasks from local and global perspective. First, the deformable mixer encoder contains two types of operators: the channel-aware mixing operator leveraged to allow communication among different channels, and the spatial-aware deformable operator with deformable convolution applied to efficiently sample more informative spatial locations. Second, the task-aware gating transformer decoder is used to perform the task-specific predictions, in which task interaction block integrated with self-attention is applied to capture task interaction features, and the task query block integrated with gating attention is leveraged to select corresponding task-specific features. Further, the experiment results demonstrate that the proposed DeMTG uses fewer GFLOPs and significantly outperforms current Transformer-based and CNN-based competitive models on a variety of metrics on three dense prediction datasets. Our code and models are available at https://github.com/yangyangxu0/DeMTG.

翻译：CNN和Transformer各有优势，均已被广泛用于多任务学习中的密集预测任务。当前大多数多任务学习研究仅依赖CNN或Transformer中的一种架构。本文提出了一种新颖的多任务学习模型，通过结合可变形CNN和基于查询的Transformer的优势，并引入共享门控机制实现多任务密集预测。这种组合由于具备强大且灵活的任务专用学习能力，同时在成本、复杂度和参数量上优于传统多任务学习方法，因而提供了一种简洁高效的解决方案。我们提出了带门控的可变形混合Transformer（DeMTG），这是一种简单高效的编码器-解码器架构，将卷积和注意力机制统一集成于多任务学习网络中。该架构精心设计以利用每个模块的优势，从局部和全局视角为所有任务提供可变形且全面的特征。首先，可变形混合编码器包含两类算子：通道感知混合算子用于促进不同通道间的信息交互，以及基于可变形卷积的空间感知可变形算子用于高效采样更具信息量的空间位置。其次，任务感知门控Transformer解码器用于执行任务特定预测，其中集成自注意力的任务交互模块用于捕获任务交互特征，而集成门控注意力的任务查询模块则用于选择相应的任务专属特征。实验结果表明，所提出的DeMTG在三个密集预测数据集的多项指标上，以更少的GFLOPs显著优于当前基于Transformer和CNN的竞争模型。我们的代码和模型已开源至 https://github.com/yangyangxu0/DeMTG。