Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelve-fold transcription speedup via batched inference.
翻译:大规模弱监督语音识别模型(如Whisper)在跨领域跨语言的语音识别任务中展现出卓越性能。然而,通过缓冲或滑动窗口方法处理长语音转录时,这些模型容易产生漂移、幻觉现象与重复问题,且其顺序处理特性限制了批量转录的实现。此外,各话语对应的时间戳存在精度不足问题,且开箱即用情况下无法获得词级时间戳。针对这些挑战,我们提出WhisperX系统——一种基于语音活动检测与强制音素对齐实现词级时间戳的精准时间语音识别系统。实验表明,该系统在长语音转录与单词切分基准测试中达到了最优性能。同时,我们提出的VAD裁切合并策略对音频进行预分割,不仅提升了转录质量,更通过批量推理实现了十二倍的转录速度提升。