In the field of information extraction (IE), tasks across a wide range of modalities and their combinations have been traditionally studied in isolation, leaving a gap in deeply recognizing and analyzing cross-modal information. To address this, this work for the first time introduces the concept of grounded Multimodal Universal Information Extraction (MUIE), providing a unified task framework to analyze any IE tasks over various modalities, along with their fine-grained groundings. To tackle MUIE, we tailor a multimodal large language model (MLLM), Reamo, capable of extracting and grounding information from all modalities, i.e., recognizing everything from all modalities at once. Reamo is updated via varied tuning strategies, equipping it with powerful capabilities for information recognition and fine-grained multimodal grounding. To address the absence of a suitable benchmark for grounded MUIE, we curate a high-quality, diverse, and challenging test set, which encompasses IE tasks across 9 common modality combinations with the corresponding multimodal groundings. The extensive comparison of Reamo with existing MLLMs integrated into pipeline approaches demonstrates its advantages across all evaluation dimensions, establishing a strong benchmark for the follow-up research. Our resources are publicly released at https://haofei.vip/MUIE.
翻译:在信息抽取(IE)领域中,跨广泛模态及其组合的任务传统上被孤立研究,这导致深度识别与分析跨模态信息存在空白。为解决此问题,本文首次提出接地多模态通用信息抽取(MUIE)概念,构建统一任务框架以分析任意模态上的IE任务及其细粒度接地。为应对MUIE挑战,我们定制了多模态大语言模型Reamo,使其能从所有模态中抽取并接地信息,即同时识别一切模态内容。通过多种调优策略更新Reamo,赋予其强大的信息识别与细粒度多模态接地能力。针对接地MUIE缺乏合适基准的问题,我们构建了高质量、多样化且具有挑战性的测试集,涵盖9种常见模态组合的IE任务及对应多模态接地。将Reamo与现有MLLM集成到管线方法的广泛对比表明,其在所有评估维度上均具有优势,为后续研究建立了强基准。我们的资源已公开发布于https://haofei.vip/MUIE。