Detecting objects based on language descriptions is a popular task that includes Open-Vocabulary object Detection (OVD) and Referring Expression Comprehension (REC). In this paper, we advance them to a more practical setting called Described Object Detection (DOD) by expanding category names to flexible language expressions for OVD and overcoming the limitation of REC to only grounding the pre-existing object. We establish the research foundation for DOD tasks by constructing a Description Detection Dataset ($D^3$), featuring flexible language expressions and annotating all described objects without omission. By evaluating previous SOTA methods on $D^3$, we find some troublemakers that fail current REC, OVD, and bi-functional methods. REC methods struggle with confidence scores, rejecting negative instances, and multi-target scenarios, while OVD methods face constraints with long and complex descriptions. Recent bi-functional methods also do not work well on DOD due to their separated training procedures and inference strategies for REC and OVD tasks. Building upon the aforementioned findings, we propose a baseline that largely improves REC methods by reconstructing the training data and introducing a binary classification sub-task, outperforming existing methods. Data and code is available at https://github.com/shikras/d-cube.
翻译:基于语言描述检测目标是热门任务,包括开放词汇目标检测(OVD)和指代表达理解(REC)。本文通过将OVD的类别名称扩展为灵活的语言表达,并克服REC仅能定位预先存在目标的局限性,将这两类任务推进到更实用的描述性目标检测(DOD)场景。我们构建了描述检测数据集($D^3$),其特点是采用灵活的语言表达,并完整标注所有描述对象,为DOD任务奠定研究基础。通过在$D^3$上评估现有SOTA方法,我们发现了当前REC、OVD及双功能方法失效的捣乱者。REC方法在置信度评分、负例拒绝及多目标场景中表现不佳,而OVD方法则受限于长文本复杂描述。近期双功能方法因对REC和OVD任务采用分离训练流程与推理策略,亦不适用于DOD任务。基于上述发现,我们提出基线方法,通过重构训练数据并引入二元分类子任务,显著提升REC方法性能,超越现有方法。数据与代码见https://github.com/shikras/d-cube。