The media localization industry usually requires a verbatim script of the final film or TV production in order to create subtitles or dubbing scripts in a foreign language. In particular, the verbatim script (i.e. as-broadcast script) must be structured into a sequence of dialogue lines each including time codes, speaker name and transcript. Current speech recognition technology alleviates the transcription step. However, state-of-the-art speaker diarization models still fall short on TV shows for two main reasons: (i) their inability to track a large number of speakers, (ii) their low accuracy in detecting frequent speaker changes. To mitigate this problem, we present a novel approach to leverage production scripts used during the shooting process, to extract pseudo-labeled data for the speaker diarization task. We propose a novel semi-supervised approach and demonstrate improvements of 51.7% relative to two unsupervised baseline models on our metrics on a 66 show test set.
翻译:媒体本地化行业通常需要最终电影或电视制作的逐字脚本,以创建外语字幕或配音脚本。具体而言,逐字脚本(即播出脚本)必须结构化为一系列对话行,每行包含时间码、说话人名称和转录文本。当前的语音识别技术简化了转录步骤。然而,最先进的说话人日志模型在电视剧集上仍存在两个主要不足:(i)无法追踪大量说话人,(ii)检测频繁说话人切换的准确性较低。为了缓解这一问题,我们提出了一种新方法,利用拍摄过程中使用的制作脚本,为说话人日志任务提取伪标签数据。我们提出了一种新颖的半监督方法,并在包含66个剧集的测试集上,相比两个无监督基线模型,在指标上实现了51.7%的改进。