Visual Relation Detection (VRD) aims to detect relationships between objects for image understanding. Most existing VRD methods rely on thousands of training samples of each relationship to achieve satisfactory performance. Some recent papers tackle this problem by few-shot learning with elaborately designed pipelines and pre-trained word vectors. However, the performance of existing few-shot VRD models is severely hampered by the poor generalization capability, as they struggle to handle the vast semantic diversity of visual relationships. Nonetheless, humans have the ability to learn new relationships with just few examples based on their knowledge. Inspired by this, we devise a knowledge-augmented, few-shot VRD framework leveraging both textual knowledge and visual relation knowledge to improve the generalization ability of few-shot VRD. The textual knowledge and visual relation knowledge are acquired from a pre-trained language model and an automatically constructed visual relation knowledge graph, respectively. We extensively validate the effectiveness of our framework. Experiments conducted on three benchmarks from the commonly used Visual Genome dataset show that our performance surpasses existing state-of-the-art models with a large improvement.
翻译:视觉关系检测(VRD)旨在检测物体之间的关系以进行图像理解。大多数现有VRD方法依赖数千个训练样本来实现每个关系的满意性能。近期一些研究通过精心设计的流程和预训练词向量,利用小样本学习来解决这一问题。然而,现有小样本VRD模型的性能因泛化能力不足而严重受限,难以应对视觉关系丰富的语义多样性。相比之下,人类能够基于自身知识,仅通过少量示例学习新关系。受此启发,我们设计了一个知识增强的小样本VRD框架,该框架结合文本知识和视觉关系知识,以提高小样本VRD的泛化能力。其中,文本知识和视觉关系知识分别从预训练语言模型和自动构建的视觉关系知识图谱中获取。我们广泛验证了所提框架的有效性。在常用Visual Genome数据集上的三个基准测试实验结果表明,我们的性能大幅超越了现有最先进模型。