Vision Foundation Models (VFMs) have demonstrated outstanding performance on numerous downstream tasks. However, due to their inherent representation biases originating from different training paradigms, VFMs exhibit advantages and disadvantages across distinct vision tasks. Although amalgamating the strengths of multiple VFMs for downstream tasks is an intuitive strategy, effectively exploiting these biases remains a significant challenge. In this paper, we propose a novel and versatile "Swiss Army Knife" (SAK) solution, which adaptively distills knowledge from a committee of VFMs to enhance multi-task learning. Unlike existing methods that use a single backbone for knowledge transfer, our approach preserves the unique representation bias of each teacher by collaborating the lightweight Teacher-Specific Adapter Path modules with the Teacher-Agnostic Stem. Through dynamic selection and combination of representations with Mixture-of-Representations Routers, our SAK is capable of synergizing the complementary strengths of multiple VFMs. Extensive experiments show that our SAK remarkably outperforms prior state of the arts in multi-task learning by 10% on the NYUD-v2 benchmark, while also providing a flexible and robust framework that can readily accommodate more advanced model designs.
翻译:视觉基础模型(VFMs)在众多下游任务中展现出卓越性能。然而,由于不同训练范式产生的固有表征偏差,VFMs在不同视觉任务中表现出各自的优势与局限。尽管整合多个VFMs的优势以应用于下游任务是一种直观策略,但如何有效利用这些偏差仍面临重大挑战。本文提出一种新颖且通用的"瑞士军刀"(SAK)解决方案,通过自适应地从VFMs委员会中蒸馏知识以增强多任务学习。与现有使用单一骨干网络进行知识迁移的方法不同,我们的方法通过将轻量级教师特定适配器路径模块与教师无关主干网络协同工作,保留了每位教师的独特表征偏差。借助混合表征路由器的动态表征选择与组合机制,SAK能够协同多个VFMs的互补优势。大量实验表明,SAK在NYUD-v2基准测试中以10%的显著优势超越先前最先进的多任务学习方法,同时提供了一个可灵活容纳更先进模型设计的鲁棒框架。