Large multimodal models (LMMs) have made impressive strides in image captioning, VQA, and video comprehension, yet they still struggle with the intricate temporal and spatial cues found in comics. To address this gap, we introduce ComicsPAP, a large-scale benchmark designed for comic strip understanding. Comprising over 100k samples and organized into 5 subtasks under a Pick-a-Panel framework, ComicsPAP demands models to identify the missing panel in a sequence. Our evaluations, conducted under both multi-image and single-image protocols, reveal that current state-of-the-art LMMs perform near chance on these tasks, underscoring significant limitations in capturing sequential and contextual dependencies. To close the gap, we adapted LMMs for comic strip understanding, obtaining better results on ComicsPAP than 10x bigger models, demonstrating that ComicsPAP offers a robust resource to drive future research in multimodal comic comprehension.
翻译:大型多模态模型(LMMs)在图像描述、视觉问答和视频理解方面取得了显著进展,但在处理漫画中复杂的时空线索时仍面临困难。为填补这一空白,我们提出了ComicsPAP——一个专为漫画条理解设计的大规模基准测试。该数据集包含超过10万个样本,并基于"选择画格"框架构建了5个子任务,要求模型在序列中识别缺失的画格。我们通过多图像与单图像两种评估协议进行的测试表明,当前最先进的LMMs在这些任务上的表现接近随机猜测,这凸显出现有模型在捕捉序列与上下文依赖关系方面存在显著局限。为缩小这一差距,我们针对漫画条理解任务对LMMs进行适配,在ComicsPAP上取得了优于参数量10倍以上模型的性能,证明ComicsPAP能为推动多模态漫画理解领域的未来研究提供有力支撑。