Self-attention mechanisms have enabled transformers to achieve superhuman-level performance on many speech-to-text (STT) tasks, yet the challenge of automatic prosodic segmentation has remained unsolved. In this paper we finetune Whisper, a pretrained STT model, to annotate intonation unit (IU) boundaries by repurposing low-frequency tokens. Our approach achieves an accuracy of 95.8%, outperforming previous methods without the need for large-scale labeled data or enterprise grade compute resources. We also diminish input signals by applying a series of filters, finding that low pass filters at a 3.2 kHz level improve segmentation performance in out of sample and out of distribution contexts. We release our model as both a transcription tool and a baseline for further improvements in prosodic segmentation.
翻译:自注意力机制使Transformer在众多语音转文本任务中达到了超人类水平,但自动韵律分割的挑战仍未解决。本文通过对预训练语音转文本模型Whisper进行微调,利用低频令牌重新标注语调单位边界。我们的方法实现了95.8%的准确率,在无需大规模标注数据或企业级计算资源的情况下优于先前方法。我们还通过应用一系列滤波器降低输入信号,发现3.2千赫兹水平的低通滤波器能提升样本外与分布外环境下的分割性能。我们将该模型作为转录工具及韵律分割后续改进的基线公开发布。