Daily images may convey abstract meanings that require us to memorize and infer profound information from them. To encourage such human-like reasoning, in this work, we teach machines to predict where and when it was taken rather than performing basic tasks like traditional segmentation or classification. Inspired by Horn's QR theory, we designed a novel QR-CLIP model consisting of two components: 1) the Quantity module first retrospects more open-world knowledge as the candidate language inputs; 2) the Relevance module carefully estimates vision and language cues and infers the location and time. Experiments show our QR-CLIP's effectiveness, and it outperforms the previous SOTA on each task by an average of about 10% and 130% relative lift in terms of location and time reasoning. This study lays a technical foundation for location and time reasoning and suggests that effectively introducing open-world knowledge is one of the panaceas for the tasks.
翻译:日常图像可能传达抽象含义,需要我们从其中记忆和推断深层信息。为促进这种类人推理能力,本研究教导机器预测图像拍摄地点与时间,而非执行传统分割或分类等基础任务。受Horn的QR理论启发,我们设计了新型QR-CLIP模型,包含两个组件:1)数量模块首先回溯更多开放世界知识作为候选语言输入;2)相关性模块精确估算视觉与语言线索并推断地点与时间。实验表明,QR-CLIP在各项任务中均优于此前最优模型,地点和时间推理性能分别平均提升约10%和130%。本研究为地点和时间推理奠定了技术基础,并表明有效引入开放世界知识是解决此类任务的关键途径之一。