We demonstrate that carefully adjusting the tokenizer of the Whisper speech recognition model significantly improves the precision of word-level timestamps when applying dynamic time warping to the decoder's cross-attention scores. We fine-tune the model to produce more verbatim speech transcriptions and employ several techniques to increase robustness against multiple speakers and background noise. These adjustments achieve state-of-the-art performance on benchmarks for verbatim speech transcription, word segmentation, and the timed detection of filler events, and can further mitigate transcription hallucinations. The code is available open https://github.com/nyrahealth/CrisperWhisper.
翻译:我们证明,在将动态时间规整应用于解码器的交叉注意力分数时,仔细调整Whisper语音识别模型的分词器,能显著提高词级时间戳的精度。我们对模型进行微调,以生成更逐字对应的语音转录,并采用多种技术来增强模型对多说话人和背景噪声的鲁棒性。这些调整在逐字语音转录、词分割以及填充事件定时检测的基准测试中实现了最先进的性能,并能进一步缓解转录幻觉。代码已开源:https://github.com/nyrahealth/CrisperWhisper。