Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction

CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Most of the current studies on MTL solely rely on CNN or Transformer. In this work, we present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction. This combination may offer a simple and efficient solution owing to its powerful and flexible task-specific learning and advantages of lower cost, less complexity and smaller parameters than the traditional MTL methods. We introduce deformable mixer Transformer with gating (DeMTG), a simple and effective encoder-decoder architecture up-to-date that incorporates the convolution and attention mechanism in a unified network for MTL. It is exquisitely designed to use advantages of each block, and provide deformable and comprehensive features for all tasks from local and global perspective. First, the deformable mixer encoder contains two types of operators: the channel-aware mixing operator leveraged to allow communication among different channels, and the spatial-aware deformable operator with deformable convolution applied to efficiently sample more informative spatial locations. Second, the task-aware gating transformer decoder is used to perform the task-specific predictions, in which task interaction block integrated with self-attention is applied to capture task interaction features, and the task query block integrated with gating attention is leveraged to select corresponding task-specific features. Further, the experiment results demonstrate that the proposed DeMTG uses fewer GFLOPs and significantly outperforms current Transformer-based and CNN-based competitive models on a variety of metrics on three dense prediction datasets. Our code and models are available at https://github.com/yangyangxu0/DeMTG.

翻译：卷积神经网络（CNN）与Transformer各自具备独特优势，且均已被广泛应用于多任务学习中的密集预测任务。当前多数多任务学习研究仅依赖单一CNN或Transformer架构。本文融合可变形CNN与基于查询的Transformer的优势，提出一种采用共享门控机制的新型多任务学习模型，用于密集预测任务。该融合策略通过强大的灵活任务特异性学习能力，在降低计算成本、减少复杂度及参数量方面优于传统多任务学习方法。我们提出带门控的可变形混合变换器（DeMTG），这是一种简洁高效的编码器-解码器架构，将卷积与注意力机制统一集成于多任务学习网络中。该架构精妙利用各模块优势，从局部与全局视角为所有任务提供可变形且全面的特征表达。首先，可变形混合编码器包含两类算子：通道感知混合算子用于实现跨通道信息交互，空间感知可变形算子则通过可变形卷积高效采样更具信息量的空间位置。其次，任务感知门控Transformer解码器执行任务特异性预测，其中集成自注意力的任务交互模块捕获任务交互特征，集成门控注意力的任务查询模块选择对应任务特异性特征。实验结果表明，所提出的DeMTG在三个密集预测数据集的多项指标上显著超越当前基于Transformer和CNN的竞争模型，同时具有更低的GFLOPs计算开销。我们的代码与模型已开放于https://github.com/yangyangxu0/DeMTG。