Over the past few years, there has been growing interest in developing a broad, universal, and general-purpose computer vision system. Such a system would have the potential to solve a wide range of vision tasks simultaneously, without being restricted to a specific problem or data domain. This is crucial for practical, real-world computer vision applications. In this study, we focus on the million-scale multi-domain universal object detection problem, which presents several challenges, including cross-dataset category label duplication, label conflicts, and the need to handle hierarchical taxonomies. Furthermore, there is an ongoing challenge in the field to find a resource-efficient way to leverage large pre-trained vision models for million-scale cross-dataset object detection. To address these challenges, we introduce our approach to label handling, hierarchy-aware loss design, and resource-efficient model training using a pre-trained large model. Our method was ranked second in the object detection track of the Robust Vision Challenge 2022 (RVC 2022). We hope that our detailed study will serve as a useful reference and alternative approach for similar problems in the computer vision community. The code is available at https://github.com/linfeng93/Large-UniDet.
翻译:近年来,开发一个广泛、通用且多用途的计算机视觉系统引起了越来越多的关注。这类系统有望同时解决多种视觉任务,而不受限于特定问题或数据领域,这对实际应用中的计算机视觉至关重要。本研究聚焦于百万级多域通用目标检测问题,该问题面临跨数据集类别标签重复、标签冲突以及层级分类体系处理等挑战。此外,如何以资源高效的方式利用大规模预训练视觉模型进行百万级跨数据集目标检测,仍是该领域的持续性难题。针对上述挑战,我们提出了标签处理方案、层级感知损失设计方法,以及基于预训练大模型的资源高效模型训练策略。本方法在2022年稳健视觉挑战赛(RVC 2022)目标检测赛道中位列第二。我们期望这项详细研究能为计算机视觉领域的类似问题提供有价值的参考和替代方案。代码开源地址:https://github.com/linfeng93/Large-UniDet。