Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. Our experiments demonstrate the effectiveness of the proposed framework, establishing a robust baseline for these tasks. Furthermore, since our method requires no training data or annotations, it can be used as-is on any comic series.
翻译:识别角色并预测对话的说话人是漫画处理任务(如语音生成或翻译)中的关键环节。然而,由于角色因漫画作品而异,需要为每部作品标注特定信息的监督学习方法(如训练角色分类器)存在可行性问题。这促使我们提出一种新颖的零样本方法,使机器能够仅凭未标注的漫画图像识别角色并预测说话人姓名。尽管这些任务在实际应用中具有重要意义,但由于故事理解与多模态整合的挑战,相关研究仍近乎空白。近年来大语言模型在文本理解与推理方面展现出卓越能力,但其在多模态内容分析中的应用仍是开放性问题。为解决该问题,我们提出了一种迭代多模态框架——这是首个同时将多模态信息应用于角色识别与说话人预测任务的框架。实验表明该框架具有有效性,为相关任务建立了稳健的基准。此外,由于本方法无需训练数据或标注,可直接应用于任何漫画系列。