Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation. To combat that, we propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting. First, we represent the fine-grained semantic structures of the input image and text with the visual and textual scene graphs, which are further fused into a unified cross-modal graph (CMG). Based on CMG, we perform structure refinement with the guidance of the graph information bottleneck principle, actively denoising the less-informative features. Next, we perform topic modeling over the input image and text, incorporating latent multimodal topic features to enrich the contexts. On the benchmark MRE dataset, our system outperforms the current best model significantly. With further in-depth analyses, we reveal the great potential of our method for the MRE task. Our codes are open at https://github.com/ChocoWu/MRE-ISE.
翻译:现有关于多模态关系抽取的研究面临两个并存的挑战:内部信息过度利用与外部信息挖掘不足。为解决这一问题,我们提出了一种新颖框架,同步实现内部信息筛选与外部信息挖掘的思想。首先,我们利用视觉和文本场景图对输入图像与文本的细粒度语义结构进行表征,并将其进一步融合为统一的跨模态图。基于跨模态图,我们依据图信息瓶颈原则引导结构精炼,主动剔除信息量较少的特征。随后,我们对输入图像与文本进行主题建模,引入潜在的多模态主题特征以丰富上下文信息。在基准多模态关系抽取数据集上,我们的系统显著优于当前最佳模型。通过深入分析,我们揭示了所提方法在多模态关系抽取任务中的巨大潜力。我们的代码已在 https://github.com/ChocoWu/MRE-ISE 开源。