Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction

CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Most of the current studies on MTL solely rely on CNN or Transformer. In this work, we present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction. This combination may offer a simple and efficient solution owing to its powerful and flexible task-specific learning and advantages of lower cost, less complexity and smaller parameters than the traditional MTL methods. We introduce deformable mixer Transformer with gating (DeMTG), a simple and effective encoder-decoder architecture up-to-date that incorporates the convolution and attention mechanism in a unified network for MTL. It is exquisitely designed to use advantages of each block, and provide deformable and comprehensive features for all tasks from local and global perspective. First, the deformable mixer encoder contains two types of operators: the channel-aware mixing operator leveraged to allow communication among different channels, and the spatial-aware deformable operator with deformable convolution applied to efficiently sample more informative spatial locations. Second, the task-aware gating transformer decoder is used to perform the task-specific predictions, in which task interaction block integrated with self-attention is applied to capture task interaction features, and the task query block integrated with gating attention is leveraged to select corresponding task-specific features. Further, the experiment results demonstrate that the proposed DeMTG uses fewer GFLOPs and significantly outperforms current Transformer-based and CNN-based competitive models on a variety of metrics on three dense prediction datasets. Our code and models are available at https://github.com/yangyangxu0/DeMTG.

翻译：卷积神经网络（CNN）与Transformer各有优势，且均已被广泛应用于多任务学习中的密集预测。当前多任务学习研究大多仅依赖CNN或Transformer。本文提出一种融合可变形CNN与基于查询的Transformer各自优势的新型多任务学习模型，通过共享门控机制实现密集预测的多任务学习。这种组合凭借其强大灵活的任务特定学习能力，以及相比传统多任务学习方法具有更低成本、更少复杂度和更小参数量的优势，可提供一种简单高效的解决方案。我们提出可变形混合器门控Transformer（DeMTG），这是一种迄今最先进的简单而有效的编码器-解码器架构，将卷积与注意力机制统一集成于多任务学习网络中。该架构精妙设计以发挥各模块优势，从局部与全局视角为所有任务提供可变形且全面的特征。首先，可变形混合器编码器包含两类算子：通道感知混合算子用于促进不同通道间的通信，以及采用可变形卷积的空间感知可变形算子，高效采样更具信息性的空间位置。其次，任务感知门控Transformer解码器用于执行任务特定预测，其中集成自注意力的任务交互模块捕获任务交互特征，集成门控注意力的任务查询模块则筛选对应任务特定特征。进一步，实验结果表明，所提出的DeMTG在三个密集预测数据集上的多项指标中，不仅计算量更少（GFLOPs更低），而且显著优于当前基于Transformer和CNN的主流模型。我们的代码与模型已开源至https://github.com/yangyangxu0/DeMTG。