In the field of information extraction (IE), tasks across a wide range of modalities and their combinations have been traditionally studied in isolation, leaving a gap in deeply recognizing and analyzing cross-modal information. To address this, this work for the first time introduces the concept of grounded Multimodal Universal Information Extraction (MUIE), providing a unified task framework to analyze any IE tasks over various modalities, along with their fine-grained groundings. To tackle MUIE, we tailor a multimodal large language model (MLLM), Reamo, capable of extracting and grounding information from all modalities, i.e., recognizing everything from all modalities at once. Reamo is updated via varied tuning strategies, equipping it with powerful capabilities for information recognition and fine-grained multimodal grounding. To address the absence of a suitable benchmark for grounded MUIE, we curate a high-quality, diverse, and challenging test set, which encompasses IE tasks across 9 common modality combinations with the corresponding multimodal groundings. The extensive comparison of Reamo with existing MLLMs integrated into pipeline approaches demonstrates its advantages across all evaluation dimensions, establishing a strong benchmark for the follow-up research. Our resources are publicly released at https://haofei.vip/MUIE.
翻译:在信息抽取领域,传统上各类模态及其组合的任务被孤立研究,导致对跨模态信息的深度识别与分析存在空白。为解决此问题,本工作首次提出**具象化多模态统一信息抽取**(Grounded Multimodal Universal Information Extraction, MUIE)概念,构建统一任务框架,可分析任意模态上的信息抽取任务及其细粒度具象化关联。为应对MUIE,我们定制了多模态大语言模型Reamo,该模型能够从所有模态中抽取并具象化信息,即一次性识别所有模态中的万物。通过多种调优策略的更新,Reamo具备了强大的信息识别能力与细粒度多模态具象化能力。针对具象化MUIE基准缺失问题,我们构建了高质量、多样且具挑战性的测试集,涵盖9种常见模态组合的信息抽取任务及其对应多模态具象化。与现有结合流水线方法的多模态大语言模型进行广泛比较,Reamo在所有评估维度上展现出显著优势,为后续研究建立了强基准。相关资源已公开于https://haofei.vip/MUIE。