Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and detection in an automatic and hierarchical manner. UDM eliminates the need for extensive manual annotation by providing a comprehensive solution. Furthermore, we introduce a simulated dysfluent dataset called VCTK++ to enhance the capabilities of UDM in phonetic transcription. Our experimental results demonstrate the effectiveness and robustness of our proposed methods in both transcription and detection tasks.
翻译:口吃语音建模需要同时在词汇层面和音素层面实现时间精确且静默感知的转录。然而,当前口吃建模研究主要聚焦于转录或检测中的单一任务,且各方面性能仍存在局限。本研究提出一种非受限性口吃建模方法,能够以自动化分级方式同时处理转录与检测任务。通过提供综合解决方案,UDM消除了对大量人工标注的需求。此外,我们构建了名为VCTK++的模拟口吃数据集,以增强UDM在音素转录方面的能力。实验结果表明,所提方法在转录与检测任务中均展现出有效性和鲁棒性。