Daily images may convey abstract meanings that require us to memorize and infer profound information from them. To encourage such human-like reasoning, in this work, we teach machines to predict where and when it was taken rather than performing basic tasks like traditional segmentation or classification. Inspired by Horn's QR theory, we designed a novel QR-CLIP model consisting of two components: 1) the Quantity module first retrospects more open-world knowledge as the candidate language inputs; 2) the Relevance module carefully estimates vision and language cues and infers the location and time. Experiments show our QR-CLIP's effectiveness, and it outperforms the previous SOTA on each task by an average of about 10% and 130% relative lift in terms of location and time reasoning. This study lays a technical foundation for location and time reasoning and suggests that effectively introducing open-world knowledge is one of the panaceas for the tasks.
翻译:日常图像可能蕴含需要人类记忆和推断其深层信息的抽象含义。为促进此类类人推理能力,本研究引导机器预测图像拍摄地点与时间,而非执行传统分割或分类等基础任务。受Horn提出的QR理论启发,我们设计了包含两大模块的新型QR-CLIP模型:1)量化模块首先回溯更多开放世界知识作为候选语言输入;2)相关性模块精细评估视觉与语言线索并进行地理定位与时间推断。实验证明QR-CLIP的有效性,其在各任务上的地理定位与时间推理性能相较先前最优方法分别平均提升约10%与130%相对增幅。本研究为地理定位与时间推理奠定技术基础,并表明有效引入开放世界知识是解决此类任务的有效途径之一。