This work explores a closure task in comics, a medium where visual and textual elements are intricately intertwined. Specifically, Text-cloze refers to the task of selecting the correct text to use in a comic panel, given its neighboring panels. Traditional methods based on recurrent neural networks have struggled with this task due to limited OCR accuracy and inherent model limitations. We introduce a novel Multimodal Large Language Model (Multimodal-LLM) architecture, specifically designed for Text-cloze, achieving a 10% improvement over existing state-of-the-art models in both its easy and hard variants. Central to our approach is a Domain-Adapted ResNet-50 based visual encoder, fine-tuned to the comics domain in a self-supervised manner using SimCLR. This encoder delivers comparable results to more complex models with just one-fifth of the parameters. Additionally, we release new OCR annotations for this dataset, enhancing model input quality and resulting in another 1% improvement. Finally, we extend the task to a generative format, establishing new baselines and expanding the research possibilities in the field of comics analysis.
翻译:本文探索了漫画这一视觉与文本元素紧密交织的媒介中的补齐任务。具体而言,文本补齐指的是根据相邻漫画格,选取正确的文本填入目标漫画格的任务。传统基于递归神经网络的方法因OCR精度有限及模型自身局限性,在此任务中表现欠佳。我们提出了一种专为文本补齐设计的新型多模态大语言模型架构,在其简易和困难两种变体上均比现有最优模型提升了10%。我们的核心方法是基于领域自适应ResNet-50的视觉编码器,该编码器通过SimCLR以自监督方式在漫画领域进行微调。该编码器仅用五分之一的参数量即可达到与更复杂模型相当的性能。此外,我们为数据集发布了新的OCR标注,提升了模型输入质量,并额外带来了1%的提升。最后,我们将任务扩展至生成式格式,建立了新的基准,拓展了漫画分析领域的研究可能性。