The SOTA in transcription of disfluent and conversational speech has in recent years favored two-stage models, with separate transcription and cleaning stages. We believe that previous attempts at end-to-end disfluency removal have fallen short because of the representational advantage that large-scale language model pretraining has given to lexical models. Until recently, the high dimensionality and limited availability of large audio datasets inhibited the development of large-scale self-supervised pretraining objectives for learning effective audio representations, giving a relative advantage to the two-stage approach, which utilises pretrained representations for lexical tokens. In light of recent successes in large scale audio pretraining, we revisit the performance comparison between two-stage and end-to-end model and find that audio based language models pretrained using weak self-supervised objectives match or exceed the performance of similarly trained two-stage models, and further, that the choice of pretraining objective substantially effects a model's ability to be adapted to the disfluency removal task.
翻译:近年来,不流畅及对话语音转录领域的最先进技术主要采用两阶段模型,即分离转录与清洗环节。我们认为,先前端到端不流畅消除尝试的不足源于大规模语言模型预训练赋予词汇模型的表征优势。直到近期,高维度音频数据集的有限可用性阻碍了大规模自监督预训练目标的发展,这使得利用词汇标记预训练表征的两阶段方法相对占优。基于近期大规模音频预训练的成功实践,我们重新评估了两阶段与端到端模型的性能对比,发现采用弱自监督目标预训练的音频语言模型能够匹配甚至超越同等训练条件下的两阶段模型性能。进一步研究表明,预训练目标的选择会显著影响模型对不流畅消除任务的适配能力。