Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE) necessitate the fundamental reasoning capacity for intricate linguistic and multimodal comprehension. In this study, we explore distilling the reasoning ability of large language models (LLMs) into a more compact student model by generating a \textit{chain of thought} (CoT) -- a sequence of intermediate reasoning steps. Specifically, we commence by exemplifying the elicitation of such reasoning ability from LLMs through CoT prompts covering multi-grain (noun, sentence, multimodality) and data-augmentation (style, entity, image) dimensions. Subsequently, we present a novel conditional prompt distillation method to assimilate the commonsense reasoning ability from LLMs, thereby enhancing the utility of the student model in addressing text-only inputs without the requisite addition of image and CoT knowledge. Extensive experiments reveal that our approach attains state-of-the-art accuracy and manifests a plethora of advantages concerning interpretability, data efficiency, and cross-domain generalization on MNER and MRE datasets.
翻译:多模态命名实体识别(MNER)与多模态关系抽取(MRE)需要具备复杂的语言及多模态理解所必需的基本推理能力。本研究探索将大型语言模型(LLMs)的推理能力蒸馏至更紧凑的学生模型中,通过生成思维链(CoT)——一系列中间推理步骤。具体而言,我们首先通过覆盖多粒度(名词、句子、多模态)及数据增强(风格、实体、图像)维度的CoT提示,示例性地激发LLMs的此类推理能力。随后,我们提出一种新颖的条件提示蒸馏方法,从LLMs中吸收常识推理能力,从而增强学生模型在处理纯文本输入时的效用,无需额外添加图像和CoT知识。大量实验表明,我们的方法在MNER和MRE数据集上达到了最先进的准确率,并在可解释性、数据效率及跨领域泛化方面展现出诸多优势。