We present T-Rex2, a highly practical model for open-set object detection. Previous open-set object detection methods relying on text prompts effectively encapsulate the abstract concept of common objects, but struggle with rare or complex object representation due to data scarcity and descriptive limitations. Conversely, visual prompts excel in depicting novel objects through concrete visual examples, but fall short in conveying the abstract concept of objects as effectively as text prompts. Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning. T-Rex2 accepts inputs in diverse formats, including text prompts, visual prompts, and the combination of both, so that it can handle different scenarios by switching between the two prompt modalities. Comprehensive experiments demonstrate that T-Rex2 exhibits remarkable zero-shot object detection capabilities across a wide spectrum of scenarios. We show that text prompts and visual prompts can benefit from each other within the synergy, which is essential to cover massive and complicated real-world scenarios and pave the way towards generic object detection. Model API is now available at \url{https://github.com/IDEA-Research/T-Rex}.
翻译:我们提出T-Rex2,一种高度实用的开放集目标检测模型。以往依赖文本提示的开放集目标检测方法虽能有效概括常见目标的抽象概念,但受限于数据稀缺和描述性不足,难以处理罕见或复杂目标表征。相比之下,视觉提示通过具体视觉示例擅长刻画新颖目标,但在传达目标抽象概念方面不及文本提示有效。鉴于文本与视觉提示的互补优势与局限,我们通过对比学习将两者协同整合至单一模型T-Rex2中。该模型支持文本提示、视觉提示及两者组合等多种输入格式,可根据不同场景灵活切换提示模态。大量实验表明,T-Rex2在广泛场景中展现出卓越的零样本目标检测能力。我们证实,文本提示与视觉提示在协同中可相互增益,这对覆盖海量复杂现实场景、推动通用目标检测发展至关重要。模型API现可通过\url{https://github.com/IDEA-Research/T-Rex}获取。