M$^{3}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction

Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named M$^{3}$D, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.

翻译：多模态信息抽取任务日益受到关注，因为许多研究表明多模态信息有助于文本信息抽取。然而，现有的多模态信息抽取数据集主要集中于英语文本的句子级图像辅助信息抽取，对基于视频的多模态信息抽取和细粒度视觉定位关注较少。因此，为促进多模态信息抽取的发展，我们构建了一个名为M$^{3}$D的多模态、多语言、多任务数据集，其具备以下特点：（1）包含成对的文档级文本与视频，以丰富多模态信息；（2）支持两种广泛使用的语言，即英语和中文；（3）涵盖更多多模态信息抽取任务，如实体识别、实体链抽取、关系抽取和视觉定位。此外，我们的数据集引入了一个尚未被探索的主题——人物传记，从而丰富了多模态信息抽取资源的领域。为了为我们的数据集建立基准，我们提出了一种创新的分层多模态信息抽取模型。该模型通过一个去噪特征融合模块（DFFM）有效利用并整合多模态信息。此外，在非理想场景下，模态信息常常不完整。因此，我们设计了一个缺失模态构建模块（MMCM）以缓解由模态缺失引起的问题。我们的模型在英语和中文数据集上的四项任务中分别取得了平均53.80%和53.77%的性能，为后续研究设定了合理的标准。此外，我们进行了更多分析性实验以验证所提出模块的有效性。我们相信，我们的工作能够推动多模态信息抽取领域的发展。