Discretized representations of speech signals are efficient alternatives to continuous features for various speech applications, including automatic speech recognition (ASR) and speech language models. However, these representations, such as semantic or phonetic tokens derived from clustering outputs of self-supervised learning (SSL) speech models, are susceptible to environmental noise, which can degrade backend task performance. In this work, we introduce a frontend system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens. We consider four types of enhancement models based on their input/output domains: wave-to-wave, token-to-token, continuous SSL features-to-token, and wave-to-token. These models are trained independently of ASR backends. Experiments on the CHiME-4 dataset demonstrate that wave-to-token enhancement achieves the best performance among the frontends. Moreover, it mostly outperforms the ASR system based on continuous SSL features.
翻译:语音信号的离散化表示作为连续特征的替代方案,在各种语音应用中具有高效性,包括自动语音识别和语音语言模型。然而,这些表示(例如源自自监督学习语音模型聚类输出的语义或音素令牌)易受环境噪声影响,可能降低后端任务性能。本文提出一种前端系统,用于从含噪语音中估计纯净语音令牌,并在使用语义令牌的自动语音识别后端上进行评估。我们根据输入/输出域考虑了四种增强模型类型:波形到波形、令牌到令牌、连续自监督学习特征到令牌以及波形到令牌。这些模型的训练独立于自动语音识别后端。在CHiME-4数据集上的实验表明,波形到令牌增强在前端系统中取得最佳性能。此外,其表现大多优于基于连续自监督学习特征的自动语音识别系统。