Long-tailed object detection (LTOD) aims to handle the extreme data imbalance in real-world datasets, where many tail classes have scarce instances. One popular strategy is to explore extra data with image-level labels, yet it produces limited results due to (1) semantic ambiguity -- an image-level label only captures a salient part of the image, ignoring the remaining rich semantics within the image; and (2) location sensitivity -- the label highly depends on the locations and crops of the original image, which may change after data transformations like random cropping. To remedy this, we propose RichSem, a simple but effective method, which is robust to learn rich semantics from coarse locations without the need of accurate bounding boxes. RichSem leverages rich semantics from images, which are then served as additional soft supervision for training detectors. Specifically, we add a semantic branch to our detector to learn these soft semantics and enhance feature representations for long-tailed object detection. The semantic branch is only used for training and is removed during inference. RichSem achieves consistent improvements on both overall and rare-category of LVIS under different backbones and detectors. Our method achieves state-of-the-art performance without requiring complex training and testing procedures. Moreover, we show the effectiveness of our method on other long-tailed datasets with additional experiments. Code is available at \url{https://github.com/MengLcool/RichSem}.
翻译:长尾目标检测(LTOD)旨在应对真实数据集中极端的数据不平衡问题,其中许多尾部类别仅有少量实例。一种常见策略是利用带有图像级标签的额外数据,但该方法效果有限,原因在于:(1)语义模糊性——图像级标签仅捕捉图像中的显著部分,忽略了图像内剩余的丰富语义;(2)位置敏感性——标签高度依赖原始图像的位置和裁剪区域,在随机裁剪等数据变换后可能发生变化。为解决这一问题,我们提出RichSem——一种简单但有效的方法,能够从粗略位置稳健地学习丰富语义,而无需精确的边界框。RichSem利用图像中的丰富语义,将其作为额外的软监督信号用于训练检测器。具体而言,我们在检测器中添加一个语义分支来学习这些软语义,并增强长尾目标检测的特征表示。该语义分支仅在训练阶段使用,推理时移除。在不同骨干网络和检测器架构下,RichSem在LVIS数据集的整体性能和稀有类别上均实现了一致提升。该方法无需复杂的训练和测试流程即可达到最先进的性能。此外,我们通过额外实验验证了该方法在其他长尾数据集上的有效性。代码开源地址:\url{https://github.com/MengLcool/RichSem}。