Vision Transformers have shown great performance in single tasks such as classification and segmentation. However, real-world problems are not isolated, which calls for vision transformers that can perform multiple tasks concurrently. Existing multi-task vision transformers are handcrafted and heavily rely on human expertise. In this work, we propose a novel one-shot neural architecture search framework, dubbed AutoTaskFormer (Automated Multi-Task Vision TransFormer), to automate this process. AutoTaskFormer not only identifies the weights to share across multiple tasks automatically, but also provides thousands of well-trained vision transformers with a wide range of parameters (e.g., number of heads and network depth) for deployment under various resource constraints. Experiments on both small-scale (2-task Cityscapes and 3-task NYUv2) and large-scale (16-task Taskonomy) datasets show that AutoTaskFormer outperforms state-of-the-art handcrafted vision transformers in multi-task learning. The entire code and models will be open-sourced.
翻译:视觉Transformer在分类、分割等单一任务上展现了卓越性能。然而,现实世界的问题并非彼此孤立,这要求视觉Transformer能够同时执行多个任务。现有的多任务视觉Transformer均为人工设计,高度依赖专家经验。本文提出一种新颖的单次神经网络架构搜索框架——AutoTaskFormer(自动多任务视觉Transformer),以实现该过程的自动化。AutoTaskFormer不仅能自动识别多个任务间需共享的权重,还能提供数千个参数范围广泛(如头数与网络深度)的预训练视觉Transformer,以便在各种资源约束下部署。在小型数据集(2任务Cityscapes与3任务NYUv2)及大型数据集(16任务Taskonomy)上的实验表明,AutoTaskFormer在多任务学习中优于最先进的人工设计视觉Transformer。完整代码与模型将进行开源。