Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.
翻译:说话人归因(基于语音转录文本)是指通过分析个体语言使用模式,从其语音转录文本中识别说话者的任务。该任务在音频不可用(如已删除)或不可靠(如匿名化语音)时尤为有用。该领域的先前研究主要集中于利用人工标注者生成的转录文本进行说话人归因的可行性。然而,在实际应用场景中,研究者通常只能获得由自动语音识别系统生成的、错误率更高的转录文本。本文开展了据我们所知首个关于自动转录对说话人归因性能影响的全面研究。具体而言,我们探究了转录错误导致说话人归因性能下降的程度,以及ASR系统特性如何影响归因效果。研究发现,归因任务对词汇级转录错误表现出惊人的鲁棒性,且恢复真实转录文本的目标与归因性能之间的相关性极低。总体而言,我们的研究结果表明:基于ASR生成的高错误率转录文本进行说话人归因的效果,即使不优于基于人工转录数据的归因,至少也与之相当。这可能是因为ASR转录错误能够捕获揭示说话人身份的特异性特征。