Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE) necessitate the fundamental reasoning capacity for intricate linguistic and multimodal comprehension. In this study, we explore distilling the reasoning ability of large language models (LLMs) into a more compact student model by generating a \textit{chain of thought} (CoT) -- a sequence of intermediate reasoning steps. Specifically, we commence by exemplifying the elicitation of such reasoning ability from LLMs through CoT prompts covering multi-grain (noun, sentence, multimodality) and data-augmentation (style, entity, image) dimensions. Subsequently, we present a novel conditional prompt distillation method to assimilate the commonsense reasoning ability from LLMs, thereby enhancing the utility of the student model in addressing text-only inputs without the requisite addition of image and CoT knowledge. Extensive experiments reveal that our approach attains state-of-the-art accuracy and manifests a plethora of advantages concerning interpretability, data efficiency, and cross-domain generalization on MNER and MRE datasets.
翻译:多模态命名实体识别(MNER)与多模态关系抽取(MRE)需要具备对复杂语言及多模态理解的底层推理能力。本研究通过生成思维链(CoT)——即一系列中间推理步骤——探索将大型语言模型(LLMs)的推理能力蒸馏至更紧凑的学生模型中。具体而言,我们首先通过涵盖多粒度(名词、句子、多模态)和数据增强(风格、实体、图像)维度的CoT提示,示范如何从LLMs中激发此类推理能力。随后,我们提出一种新颖的条件性提示蒸馏方法,以吸收LLMs的常识推理能力,从而提升学生模型在处理纯文本输入时的效用,无需额外补充图像和CoT知识。大量实验表明,我们的方法在MNER和MRE数据集上取得了最先进的准确率,并在可解释性、数据效率和跨领域泛化方面展现出显著优势。