The goal of few-shot relation extraction is to predict relations between name entities in a sentence when only a few labeled instances are available for training. Existing few-shot relation extraction methods focus on uni-modal information such as text only. This reduces performance when there are no clear contexts between the name entities described in text. We propose a multi-modal few-shot relation extraction model (MFS-HVE) that leverages both textual and visual semantic information to learn a multi-modal representation jointly. The MFS-HVE includes semantic feature extractors and multi-modal fusion components. The MFS-HVE semantic feature extractors are developed to extract both textual and visual features. The visual features include global image features and local object features within the image. The MFS-HVE multi-modal fusion unit integrates information from various modalities using image-guided attention, object-guided attention, and hybrid feature attention to fully capture the semantic interaction between visual regions of images and relevant texts. Extensive experiments conducted on two public datasets demonstrate that semantic visual information significantly improves the performance of few-shot relation prediction.
翻译:小样本关系抽取的目标是在仅有少量标注样本可用的训练条件下,预测句子中命名实体间的关系。现有小样本关系抽取方法仅关注文本等单模态信息,当文本描述中命名实体间缺乏明确上下文时,模型性能会下降。本文提出一种多模态小样本关系抽取模型(MFS-HVE),该模型联合利用文本和视觉语义信息学习多模态表征。MFS-HVE包含语义特征提取器和多模态融合组件:语义特征提取器用于提取文本特征与视觉特征(包括图像全局特征和局部目标特征);多模态融合单元通过图像引导注意力、目标引导注意力与混合特征注意力,深度融合不同模态信息,充分捕捉图像视觉区域与相关文本间的语义交互。在两个公开数据集上的大量实验表明,语义视觉信息能显著提升小样本关系预测的性能。