Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.
翻译:近年来,语音处理领域的进展显著推动了自动语音识别(ASR)、音素识别(PR)、字素到音素转换(G2P)以及音素到字素转换(P2G)等语音学任务的发展。尽管这些任务在概念上具有相似性,但它们大多被孤立研究,各自依赖于特定任务的架构和数据集。本文提出了POWSM(Phonetic Open Whisper-style Speech Model),这是首个能够联合执行多种音素相关任务的统一框架。POWSM实现了音频、文本(字素)和音素之间的无缝转换,为通用和低资源语音处理开辟了新途径。我们的模型在性能上优于或匹配了相似规模的专用PR模型(Wav2Vec2Phoneme和ZIPA),同时联合支持G2P、P2G和ASR任务。我们公开了训练数据、代码和模型,以促进开放科学的发展。