The expanding market for e-comics has spurred interest in the development of automated methods to analyze comics. For further understanding of comics, an automated approach is needed to link text in comics to characters speaking the words. Comics speaker detection research has practical applications, such as automatic character assignment for audiobooks, automatic translation according to characters' personalities, and inference of character relationships and stories. To deal with the problem of insufficient speaker-to-text annotations, we created a new annotation dataset Manga109Dialog based on Manga109. Manga109Dialog is the world's largest comics speaker annotation dataset, containing 132,692 speaker-to-text pairs. We further divided our dataset into different levels by prediction difficulties to evaluate speaker detection methods more appropriately. Unlike existing methods mainly based on distances, we propose a deep learning-based method using scene graph generation models. Due to the unique features of comics, we enhance the performance of our proposed model by considering the frame reading order. We conducted experiments using Manga109Dialog and other datasets. Experimental results demonstrate that our scene-graph-based approach outperforms existing methods, achieving a prediction accuracy of over 75%.
翻译:电子漫画市场的扩张激发了自动分析漫画方法的研究兴趣。为深入理解漫画,需要一种将漫画中文本与对应说话角色自动关联的方法。漫画说话人检测研究具有实际应用价值,例如为有声读物自动分配角色、根据角色性格进行自动翻译,以及推断角色关系与故事情节。为解决说话人与文本标注不足的问题,我们基于Manga109创建了新的标注数据集Manga109Dialog。Manga109Dialog是全球最大的漫画说话人标注数据集,包含132,692个说话人-文本对。我们进一步根据预测难度将数据集划分为不同层级,以更合理地评估说话人检测方法。与现有主要基于距离的方法不同,我们提出了一种利用场景图生成模型的深度学习方法。针对漫画的独特特征,我们通过考虑分镜阅读顺序来增强所提模型的性能。我们使用Manga109Dialog及其他数据集进行了实验,结果表明,基于场景图的方法优于现有方法,预测准确率超过75%。