Speech processing for low-resource dialects remains a fundamental challenge in developing inclusive and robust speech technologies. Despite its linguistic significance and large speaker population, the Wu dialect of Chinese has long been hindered by the lack of large-scale speech data, standardized evaluation benchmarks, and publicly available models. In this work, we present WenetSpeech-Wu, the first large-scale, multi-dimensionally annotated open-source speech corpus for the Wu dialect, comprising approximately 8,000 hours of diverse speech data. Building upon this dataset, we introduce WenetSpeech-Wu-Bench, the first standardized and publicly accessible benchmark for systematic evaluation of Wu dialect speech processing, covering automatic speech recognition (ASR), Wu-to-Mandarin translation, speaker attribute prediction, speech emotion recognition, text-to-speech (TTS) synthesis, and instruction-following TTS (instruct TTS). Furthermore, we release a suite of strong open-source models trained on WenetSpeech-Wu, establishing competitive performance across multiple tasks and empirically validating the effectiveness of the proposed dataset. Together, these contributions lay the foundation for a comprehensive Wu dialect speech processing ecosystem, and we open-source proposed datasets, benchmarks, and models to support future research on dialectal speech intelligence.
翻译:低资源方言的语音处理仍是发展包容性及鲁棒性语音技术的基础性挑战。尽管具有重要的语言学意义及庞大的使用人群,汉语吴语长期以来受限于大规模语音数据的缺失、标准化评估基准的缺乏以及公开可用模型的不足。本研究提出首个面向吴语的大规模、多维度标注开源语音语料库——WenetSpeech-Wu,其包含约8,000小时多样化语音数据。基于此数据集,我们构建了首个标准化、可公开访问的吴语语音处理系统性评估基准WenetSpeech-Wu-Bench,涵盖自动语音识别(ASR)、吴语-普通话翻译、说话人属性预测、语音情感识别、文本到语音(TTS)合成以及指令跟随式TTS(instruct TTS)等任务。此外,我们发布了一套基于WenetSpeech-Wu训练的高性能开源模型,在多项任务中展现出竞争力表现,并通过实证验证了所提出数据集的有效性。这些贡献共同构建了完整吴语语音处理生态系统的基础。我们已将所提出的数据集、基准与模型开源,以支持未来方言语音智能领域的研究。