Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model

Visual information helps resolve ambiguity in coreference resolution, leading to notable performance gains. However, existing Multi-modal Coreference Resolution (MCR) methods require training with (partially) annotated data from the target dataset before they can be applied, preventing their direct usability and raising concerns about generalization. While Vision-Language Large Models (VLLMs) with billions of parameters offer promising zero-shot capabilities, they remain largely inaccessible. Their massive size limits deployability, and many are only accessible through paid APIs. In this paper, we propose a plug-and-adapt method that strategically adapts a carefully pre-trained \emph{alignment model} for immediate use in MCR tasks, designed to eliminate the need for training on scarce benchmark datasets or relying on resource-intensive VLLMs. Specifically, we first pre-train a fine-grained alignment model between textual and visual contextual information using vision-language alignment datasets. We then repurpose the alignment model to MCR through similarity aggregation by fusing visual and categorical cues with evidence theory, thereby enhancing effectiveness. Experiments on the Coreference Image Narratives (CIN) benchmark dataset demonstrate the effectiveness of our method, achieving a 5.31\% and 2.12\% improvement in CoNLL F1 over SOTA dedicated methods and popular VLLMs, respectively. We further evaluate our method on a masked CIN dataset for robustness testing and on a specially constructed VCR-MCR dataset for generalization assessment, with results confirming both capabilities.

翻译：视觉信息有助于解决共指解析中的歧义问题，从而显著提升性能。然而，现有跨模态共指解析方法需要预先使用目标数据集的部分或全部标注数据进行训练方能应用，这限制了其直接可用性并引发泛化性担忧。虽然拥有数十亿参数的视觉-语言大模型展现出有前景的零样本能力，但其仍存在较大局限性：庞大的模型体积限制了部署可行性，且多数仅能通过付费API访问。本文提出一种即插即适应方法，通过策略性地适配精心预训练的对齐模型直接应用于跨模态共指解析任务，旨在消除对稀缺基准数据集的训练依赖或对资源密集型大模型的依赖。具体而言，我们首先利用视觉-语言对齐数据集，预训练文本与视觉上下文信息之间的细粒度对齐模型；继而通过证据理论融合视觉线索与类别特征，基于相似度聚合机制将预训练对齐模型迁移至跨模态共指解析任务，从而增强有效性。在共指图像叙事基准数据集上的实验表明，本方法相较专业领域最优方法及主流大模型分别取得了5.31%和2.12%的CoNLL F1值提升。此外，我们在掩码版本图像叙事数据集上进行鲁棒性测试，并在专门构建的视觉-语言共指解析数据集上开展泛化性评估，实验结果验证了方法的双重能力。