An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training

We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Despite considerable progress in multi-task learning, most efforts focus on learning from multi-label data: a single image set with multiple task labels. Such multi-label data sets are rare, small, and expensive. We say heterogeneous to refer to image sets with different task labels, or to combinations of single-task datasets. Few have explored training on such heterogeneous datasets. General-purpose vision models are still dominated by single-task pretraining, and it remains unclear how to scale up multi-task models by leveraging mainstream vision datasets designed for different purposes. The challenges lie in managing large intrinsic differences among vision tasks, including data distribution, architectures, task-specific modules, dataset scales, and sampling strategies. To address these challenges, we propose to modify and scale up mixture-of-experts (MoE) vision transformers, so that they can simultaneously learn classification, detection, and segmentation on diverse mainstream vision datasets including ImageNet, COCO, and ADE20K. Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks. Due to its emergent modularity, this general-purpose model decomposes into high-performing components, efficiently adapting to downstream tasks. We can fine-tune it with fewer training parameters, fewer model parameters, and less computation. Additionally, its modularity allows for easy expansion in continual-learning-without-forgetting scenarios. Finally, these functions can be controlled and combined to meet various demands of downstream tasks.

翻译：我们提出了一种能够执行多项视觉任务并高效适配下游任务的模型。尽管多任务学习已取得显著进展，但大多数研究聚焦于多标签数据学习（即单一图像集包含多种任务标签），然而这类数据集数量稀少、规模有限且成本高昂。我们采用"异构"一词指代具有不同任务标签的图像集，或由多个单任务数据集组合而成的集合。目前鲜有研究探索基于此类异构数据集的训练方法。通用视觉模型仍以单任务预训练为主导，而如何利用面向不同用途的主流视觉数据集扩展多任务模型规模仍不明确。其挑战在于需要协调视觉任务间的巨大内在差异，包括数据分布、架构设计、任务特定模块、数据集规模及采样策略。为应对这些挑战，我们提出对混合专家（MoE）视觉Transformer进行改进与扩展，使其能够在ImageNet、COCO和ADE20K等多样化主流视觉数据集上同步学习分类、检测与分割任务。本方法在单任务最先进模型上取得可比结果，并展现出优异的跨任务泛化能力。得益于其涌现的模块化特性，该通用模型可分解为高性能组件，高效适配下游任务——我们能在减少训练参数、模型参数和计算量的情况下进行微调。此外，其模块化特性支持在持续学习（无遗忘场景）中便捷扩展。最终，这些功能模块可被灵活调控与组合，以满足下游任务的多样化需求。