Multimodal manga analysis focuses on enhancing manga understanding with visual and textual features, which has attracted considerable attention from both natural language processing and computer vision communities. Currently, most comics are hand-drawn and prone to problems such as missing pages, text contamination, and aging, resulting in missing comic text content and seriously hindering human comprehension. In other words, the Multimodal Manga Complement (M2C) task has not been investigated, which aims to handle the aforementioned issues by providing a shared semantic space for vision and language understanding. To this end, we first propose the Multimodal Manga Complement task by establishing a new M2C benchmark dataset covering two languages. First, we design a manga argumentation method called MCoT to mine event knowledge in comics with large language models. Then, an effective baseline FVP-M$^{2}$ using fine-grained visual prompts is proposed to support manga complement. Extensive experimental results show the effectiveness of FVP-M$^{2}$ method for Multimodal Mange Complement.
翻译:多模态漫画分析聚焦于利用视觉和文本特征增强漫画理解,已引起自然语言处理和计算机视觉领域的广泛关注。当前绝大多数漫画为手绘作品,易出现缺页、文本污染及老化等问题,导致漫画文本内容缺失,严重阻碍人类理解。换言之,多模态漫画补全(M2C)任务尚未得到充分研究,该任务旨在通过为视觉与语言理解构建共享语义空间来处理上述问题。为此,我们首先提出多模态漫画补全任务,并建立涵盖两种语言的新M2C基准数据集。首先,设计名为MCoT的漫画论证方法,通过大语言模型挖掘漫画中的事件知识。随后,提出基于细粒度视觉提示的有效基线方法FVP-M$^{2}$以支持漫画补全。大量实验结果表明,FVP-M$^{2}$方法在多模态漫画补全任务中具有有效性。