Hyperbolic Multimodal Generative Representation Learning for Generalized Zero-Shot Multimodal Information Extraction

Multimodal information extraction (MIE) constitutes a set of essential tasks aimed at extracting structural information from Web texts with integrating images, to facilitate the structural construction of Web-based semantic knowledge. To address the expanding category set including newly emerging entity types or relations on websites, prior research proposed the zero-shot MIE (ZS-MIE) task which aims to extract unseen structural knowledge with textual and visual modalities. However, the ZS-MIE models are limited to recognizing the samples that fall within the unseen category set, and they struggle to deal with real-world scenarios that encompass both seen and unseen categories. The shortcomings of existing methods can be ascribed to two main aspects. On one hand, these methods construct representations of samples and categories within Euclidean space, failing to capture the hierarchical semantic relationships between the two modalities within a sample and their corresponding category prototypes. On the other hand, there is a notable gap in the distribution of semantic similarity between seen and unseen category sets, which impacts the generative capability of the ZS-MIE models. To overcome the disadvantages, we delve into the generalized zero-shot MIE (GZS-MIE) task and propose the hyperbolic multimodal generative representation learning framework (HMGRL). The variational information bottleneck and autoencoder networks are reconstructed with hyperbolic space for modeling the multi-level hierarchical semantic correlations among samples and prototypes. Furthermore, the proposed model is trained with the unseen samples generated by the decoder, and we introduce the semantic similarity distribution alignment loss to enhance the model's generalization performance. Experimental evaluations on two benchmark datasets underscore the superiority of HMGRL compared to existing baseline methods.

翻译：多模态信息抽取（MIE）是一类关键任务，旨在通过融合图像从网络文本中抽取结构化信息，以促进基于网络的语义知识结构构建。为应对网站上不断扩展的类别集合（包括新出现的实体类型或关系），先前研究提出了零样本MIE（ZS-MIE）任务，旨在利用文本和视觉模态抽取未见结构知识。然而，ZS-MIE模型仅能识别属于未见类别集合的样本，难以处理同时包含可见和未见类别的实际场景。现有方法的缺陷主要源于两个方面：一方面，这些方法在欧几里得空间中构建样本和类别表征，未能捕捉样本内两种模态之间及其对应类别原型间的层级语义关系；另一方面，可见与未见类别集合的语义相似度分布存在显著差异，这影响了ZS-MIE模型的生成能力。为克服上述局限，我们深入探索广义零样本MIE（GZS-MIE）任务，并提出双曲多模态生成表征学习框架（HMGRL）。该框架利用双曲空间重构变分信息瓶颈与自编码器网络，以建模样本与原型间的多层次层级语义关联。此外，所提模型通过解码器生成的未见样本进行训练，并引入语义相似度分布对齐损失以增强模型泛化性能。在两个基准数据集上的实验评估表明，HMGRL相比现有基线方法具有显著优势。