Images contain rich relational knowledge that can help machines understand the world. Existing methods on visual knowledge extraction often rely on the pre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation types), restricting the expressiveness of the extracted knowledge. In this work, we take a first exploration to a new paradigm of open visual knowledge extraction. To achieve this, we present OpenVik which consists of an open relational region detector to detect regions potentially containing relational knowledge and a visual knowledge generator that generates format-free knowledge by prompting the large multimodality model with the detected region of interest. We also explore two data enhancement techniques for diversifying the generated format-free visual knowledge. Extensive knowledge quality evaluations highlight the correctness and uniqueness of the extracted open visual knowledge by OpenVik. Moreover, integrating our extracted knowledge across various visual reasoning applications shows consistent improvements, indicating the real-world applicability of OpenVik.
翻译:图像中包含丰富的的关系性知识,能够帮助机器理解世界。现有的视觉知识抽取方法通常依赖预定义格式(如主语-动词-宾语元组)或预定义词汇表(如关系类型),这限制了所抽取知识的表达能力。本研究首次探索了开放视觉知识抽取的新范式。为此,我们提出OpenVik方法,该方法包含一个开放关系区域检测器,用于检测可能包含关系性知识的区域,以及一个视觉知识生成器,通过向大型多模态模型提示检测到的感兴趣区域来生成无格式约束的知识。我们还探索了两种数据增强技术,用于多样化生成的无格式视觉知识。广泛的知识质量评估凸显了OpenVik所抽取开放视觉知识的正确性与独特性。此外,将我们抽取的知识整合到多种视觉推理应用中均展现了持续的性能提升,表明OpenVik具有实际应用价值。