Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. Our experiments demonstrate the effectiveness of the proposed framework, establishing a robust baseline for these tasks. Furthermore, since our method requires no training data or annotations, it can be used as-is on any comic series.
翻译:角色识别与对话说话人预测是漫画处理任务(如语音生成或翻译)的关键环节。然而,由于角色因漫画作品而异,基于监督学习的方法(如需要为每部漫画标注特定数据的角色分类器训练)并不可行。这促使我们提出一种新颖的零样本方法,使机器仅凭未标注的漫画图像即可识别角色并预测说话人姓名。尽管这些任务在实际应用中至关重要,但由于故事理解与多模态整合的挑战,相关研究仍基本处于空白。近年来,大型语言模型展现出卓越的文本理解与推理能力,但其在多模态内容分析中的应用仍是开放问题。为解决这一难题,我们提出一种迭代式多模态框架,首次将多模态信息同时用于角色识别与说话人预测任务。实验结果表明,该框架具有显著有效性,为上述任务建立了稳健的基准基线。此外,由于本方法无需训练数据或标注信息,可直接应用于任何漫画系列作品。