In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet, the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work, we seek to address this substantial barrier, with the aim of ensuring that manga can be appreciated and actively engaged by everyone. Specifically, we tackle the problem of diarisation i.e. generating a transcription of who said what and when, in a fully automatic way. To this end, we make the following contributions: (1) we present a unified model, Magi, that is able to (a) detect panels, text boxes and character boxes, (b) cluster characters by identity (without knowing the number of clusters apriori), and (c) associate dialogues to their speakers; (2) we propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript; (3) we annotate an evaluation benchmark for this task using publicly available [English] manga pages. The code, evaluation datasets and the pre-trained model can be found at: https://github.com/ragavsachdeva/magi.
翻译:在过去几十年里,日本漫画(即Manga)超越了文化和语言的界限,成为全球范围内的真正热潮。然而,漫画对视觉线索和插图的固有依赖,使其在很大程度上无法被视觉障碍者所获取。在本文中,我们致力于解决这一重大障碍,旨在确保每个人都能欣赏并积极参与到漫画的体验中。具体而言,我们处理了说话人识别问题,即以全自动方式生成"谁在何时说了什么"的转录。为此,我们做出了以下贡献:(1) 提出了一个统一模型Magi,该模型能够(a)检测分格、文本框和角色框,(b)在未知聚类数量的情况下按身份对角色进行聚类,以及(c)将对话与说话者相关联;(2) 提出了一种新颖方法,能够按阅读顺序对检测到的文本框进行排序并生成对话转录;(3) 使用公开可用的[英文]漫画页面,为这一任务标注了评估基准。代码、评估数据集和预训练模型可在以下地址获取:https://github.com/ragavsachdeva/magi。