This work presents a unified knowledge protocol, called UKnow, which facilitates knowledge-based studies from the perspective of data. Particularly focusing on visual and linguistic modalities, we categorize data knowledge into five unit types, namely, in-image, in-text, cross-image, cross-text, and image-text, and set up an efficient pipeline to help construct the multimodal knowledge graph from any data collection. Thanks to the logical information naturally contained in knowledge graph, organizing datasets under UKnow format opens up more possibilities of data usage compared to the commonly used image-text pairs. Following UKnow protocol, we collect, from public international news, a large-scale multimodal knowledge graph dataset that consists of 1,388,568 nodes (with 571,791 vision-related ones) and 3,673,817 triplets. The dataset is also annotated with rich event tags, including 11 coarse labels and 9,185 fine labels. Experiments on four benchmarks demonstrate the potential of UKnow in supporting common-sense reasoning and boosting vision-language pre-training with a single dataset, benefiting from its unified form of knowledge organization. Code, dataset, and models will be made publicly available.
翻译:本文提出一种名为UKnow的统一知识协议,从数据视角推动基于知识的研究。本工作特别聚焦于视觉与语言模态,将数据知识划分为五种单元类型:图像内知识、文本内知识、跨图像知识、跨文本知识与图像-文本知识,并构建高效流水线以从任意数据集合中生成多模态知识图谱。得益于知识图谱天然蕴含的逻辑信息,与常用的图像-文本对相比,采用UKnow格式组织数据集可拓展更多数据应用可能性。遵循UKnow协议,我们从国际公共新闻数据中构建了大规模多模态知识图谱数据集,包含1,388,568个节点(其中571,791个为视觉相关节点)及3,673,817个三元组。该数据集同时标注了丰富的事件标签体系,涵盖11个粗粒度标签与9,185个细粒度标签。在四个基准测试上的实验表明,UKnow凭借其统一的知识组织形式,能够以单一数据集同时支持常识推理任务并提升视觉-语言预训练效果。相关代码、数据集与模型将公开发布。