Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction

CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Most of the current studies on MTL solely rely on CNN or Transformer. In this work, we present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction. This combination may offer a simple and efficient solution owing to its powerful and flexible task-specific learning and advantages of lower cost, less complexity and smaller parameters than the traditional MTL methods. We introduce deformable mixer Transformer with gating (DeMTG), a simple and effective encoder-decoder architecture up-to-date that incorporates the convolution and attention mechanism in a unified network for MTL. It is exquisitely designed to use advantages of each block, and provide deformable and comprehensive features for all tasks from local and global perspective. First, the deformable mixer encoder contains two types of operators: the channel-aware mixing operator leveraged to allow communication among different channels, and the spatial-aware deformable operator with deformable convolution applied to efficiently sample more informative spatial locations. Second, the task-aware gating transformer decoder is used to perform the task-specific predictions, in which task interaction block integrated with self-attention is applied to capture task interaction features, and the task query block integrated with gating attention is leveraged to select corresponding task-specific features. Further, the experiment results demonstrate that the proposed DeMTG uses fewer GFLOPs and significantly outperforms current Transformer-based and CNN-based competitive models on a variety of metrics on three dense prediction datasets. Our code and models are available at https://github.com/yangyangxu0/DeMTG.

翻译：CNNs和Transformers各自具有优势，并已广泛应用于多任务学习（MTL）中的密集预测。当前大多数MTL研究仅依赖CNN或Transformer。本文提出了一种新颖的MTL模型，将可变形CNN与基于查询的Transformer的优势相结合，并采用共享门控机制用于密集预测的多任务学习。这种结合因其强大而灵活的任务特定学习能力，以及相比传统MTL方法成本更低、复杂度更小、参数更少的优势，提供了一种简单高效的解决方案。我们引入了带门控的可变形混合器Transformer（DeMTG），这是一种最新设计的简单有效的编码器-解码器架构，将卷积和注意力机制统一到一个网络中用于MTL。该架构精心设计以利用每个模块的优势，从局部和全局视角为所有任务提供可变形且全面的特征。首先，可变形混合器编码器包含两种操作符：通道感知混合操作符用于促进不同通道间的通信，以及基于可变形卷积的空间感知可变形操作符用于高效采样更具信息量的空间位置。其次，任务感知门控Transformer解码器用于执行任务特定预测，其中集成自注意力机制的任务交互模块用于捕获任务交互特征，而集成门控注意力机制的任务查询模块用于选择相应的任务特定特征。此外，实验结果表明，所提出的DeMTG在三个密集预测数据集上均以更少的GFLOPs显著优于当前基于Transformer和基于CNN的竞争模型。我们的代码和模型可在https://github.com/yangyangxu0/DeMTG获取。