Construction of a universal detector poses a crucial question: How can we most effectively train a model on a large mixture of datasets? The answer lies in learning dataset-specific features and ensembling their knowledge but do all this in a single model. Previous methods achieve this by having separate detection heads on a common backbone but that results in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose Dataset-Aware Mixture-of-Experts, DAMEX where we train the experts to become an `expert' of a dataset by learning to route each dataset tokens to its mapped expert. Experiments on Universal Object-Detection Benchmark show that we outperform the existing state-of-the-art by average +10.2 AP score and improve over our non-MoE baseline by average +2.0 AP score. We also observe consistent gains while mixing datasets with (1) limited availability, (2) disparate domains and (3) divergent label sets. Further, we qualitatively show that DAMEX is robust against expert representation collapse.
翻译:通用检测器的构建提出了一个关键问题:如何最有效地在大量混合数据集上训练模型?答案在于学习数据集特定特征并整合其知识,但这一切需在单一模型中完成。此前的方法通过在共享骨干网络上设置独立检测头实现这一目标,却导致参数显著增加。本工作中,我们提出混合专家模型作为解决方案,强调混合专家模型远不止是一种扩展性工具。我们设计了数据感知混合专家模型DAMEX,通过训练专家学习将各数据集的令牌路由至对应专家,使其成为该数据集的"专家"。在通用目标检测基准上的实验表明,我们的方法平均AP得分较现有最优方法提升10.2%,较非混合专家模型基线提升2.0%。在混合(1)低资源、(2)跨领域及(3)标签集差异显著的数据集时,我们同样观察到一致的性能提升。此外,定性分析显示DAMEX对专家表示坍缩具有鲁棒性。