Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects. This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism. Moreover, training takes place in a weakly supervised setting, where no bounding boxes are provided. To achieve this, our method combines two pre-trained networks: the CLIP image-to-text matching score and the BLIP image captioning tool. Training takes place on COCO images and their captions and is based on CLIP. Then, during inference, BLIP is used to generate a hypothesis regarding various regions of the current image. Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains. It also shows very convincing results in the novel task of weakly-supervised open-world purely visual phrase-grounding presented in our work. For example, on the datasets used for benchmarking phrase-grounding, our method results in a very modest degradation in comparison to methods that employ human captions as an additional input. Our code is available at https://github.com/talshaharabany/what-is-where-by-looking and a live demo can be found at https://replicate.com/talshaharabany/what-is-where-by-looking.
翻译:根据输入图象,而别无其他,我们的方法返回图像和描述对象的短语中的物体框。这是在一个开放的世界范式中实现的,在这个范式中,输入图象中的物体在培训本地化机制期间可能没有遇到过。此外,培训是在监督薄弱的环境中进行的,没有提供约束框。为了实现这一点,我们的方法将两个预先训练的网络结合起来:CLIP图像对文本匹配评分和BLIP图像说明工具。培训在COCO图像及其标题上进行,并以CLIP为基础。然后,在推断期间,BLIP被用来对当前图像的各个区域产生假设。我们的工作一般地将监管不力的分割和语句置于一个薄弱的环境下,并且从经验上展示出两个区域中艺术状态的超常化。为了实现这一点,我们的工作还展示了非常令人信服的新任务中的结果,即弱的对开放世界纯直观的直观语句基础。例如用于基准化语句背景的数据集,在推断中,我们的方法将一个非常轻微的直观/直观/直截/直截面的路径进行。