Detecting objects based on language information is a popular task that includes Open-Vocabulary object Detection (OVD) and Referring Expression Comprehension (REC). In this paper, we advance them to a more practical setting called Described Object Detection (DOD) by expanding category names to flexible language expressions for OVD and overcoming the limitation of REC only grounding the pre-existing object. We establish the research foundation for DOD by constructing a Description Detection Dataset ($D^3$). This dataset features flexible language expressions, whether short category names or long descriptions, and annotating all described objects on all images without omission. By evaluating previous SOTA methods on $D^3$, we find some troublemakers that fail current REC, OVD, and bi-functional methods. REC methods struggle with confidence scores, rejecting negative instances, and multi-target scenarios, while OVD methods face constraints with long and complex descriptions. Recent bi-functional methods also do not work well on DOD due to their separated training procedures and inference strategies for REC and OVD tasks. Building upon the aforementioned findings, we propose a baseline that largely improves REC methods by reconstructing the training data and introducing a binary classification sub-task, outperforming existing methods. Data and code are available at https://github.com/shikras/d-cube and related works are tracked in https://github.com/Charles-Xie/awesome-described-object-detection.
翻译:基于语言信息的目标检测是一项热门任务,涵盖开放词汇目标检测(OVD)和指代表达理解(REC)。本文将其推进至更实用的设置——描述式目标检测(DOD),通过将OVD的类别名称扩展为灵活的语言表达,并突破REC仅能定位预设目标的局限。我们构建了描述检测数据集($D^3$),为DOD奠定研究基础。该数据集既包含简短类别名称也包含长篇描述等灵活语言表达,并对所有图像中的描述目标进行无遗漏标注。通过评估以往SOTA方法在$D^3$上的表现,我们发现当前REC、OVD及双功能方法均存在若干缺陷。REC方法在置信度评分、负样本拒绝及多目标场景中表现不佳,OVD方法则受限于长句复杂描述。近期双功能方法因针对REC和OVD任务采用分离的训练流程与推理策略,也难以有效处理DOD问题。基于上述发现,我们提出一种基线方法,通过重构训练数据并引入二分类子任务,显著改进REC方法性能,超越现有方法。数据与代码开源于https://github.com/shikras/d-cube,相关研究动态跟踪于https://github.com/Charles-Xie/awesome-described-object-detection。