Daily images may convey abstract meanings that require us to memorize and infer profound information from them. To encourage such human-like reasoning, in this work, we teach machines to predict where and when it was taken rather than performing basic tasks like traditional segmentation or classification. Inspired by Horn's QR theory, we designed a novel QR-CLIP model consisting of two components: 1) the Quantity module first retrospects more open-world knowledge as the candidate language inputs; 2) the Relevance module carefully estimates vision and language cues and infers the location and time. Experiments show our QR-CLIP's effectiveness, and it outperforms the previous SOTA on each task by an average of about 10% and 130% relative lift in terms of location and time reasoning. This study lays a technical foundation for location and time reasoning and suggests that effectively introducing open-world knowledge is one of the panaceas for the tasks.
翻译:日常图像可能传达抽象含义,需要我们从记忆中推断深层信息。为促进此类类人推理能力,本研究教导机器预测图像拍摄地点与时间,而非执行传统分割或分类等基础任务。受Horn的QR理论启发,我们设计了一种新型QR-CLIP模型,包含两个组件:1)数量模块首先回溯更多开放世界知识作为候选语言输入;2)相关性模块精细评估视觉与语言线索,进而推断位置与时间。实验表明,QR-CLIP模型的有效性验证,其在位置与时间推理任务上分别实现约10%与130%的相对性能提升,超越此前各任务最先进方法。本研究为位置与时间推理奠定了技术基础,并表明有效引入开放世界知识是解决此类任务的关键途径之一。