Manga is a popular medium that combines stylized drawings and text to convey stories. As manga panels differ from natural images, computational systems traditionally had to be designed specifically for manga. Recently, the adaptive nature of modern large multimodal models (LMMs) shows possibilities for more general approaches. To provide an analysis of the current capability of LMMs for manga understanding tasks and identifying areas for their improvement, we design and evaluate MangaUB, a novel manga understanding benchmark for LMMs. MangaUB is designed to assess the recognition and understanding of content shown in a single panel as well as conveyed across multiple panels, allowing for a fine-grained analysis of a model's various capabilities required for manga understanding. Our results show strong performance on the recognition of image content, while understanding the emotion and information conveyed across multiple panels is still challenging, highlighting future work towards LMMs for manga understanding.
翻译:漫画是一种结合风格化绘画与文字来讲述故事的流行媒介。由于漫画分格与自然图像存在差异,传统计算系统通常需要专门针对漫画进行设计。近年来,现代大型多模态模型(LMMs)的适应能力为更通用的方法提供了可能性。为分析当前LMMs在漫画理解任务上的能力并明确其改进方向,我们设计并评估了MangaUB——一个面向LMMs的新型漫画理解基准。MangaUB旨在评估模型对单格漫画内容的识别理解能力以及跨多格漫画的信息理解能力,从而实现对漫画理解所需各项能力的细粒度分析。实验结果表明,现有模型在图像内容识别方面表现优异,但在理解跨多格漫画传递的情感和信息方面仍面临挑战,这为未来面向漫画理解的LMMs研究指明了方向。