Unsupervised speech models are becoming ubiquitous in the speech and machine learning communities. Upstream models are responsible for learning meaningful representations from raw audio. Later, these representations serve as input to downstream models to solve a number of tasks, such as keyword spotting or emotion recognition. As edge speech applications start to emerge, it is important to gauge how robust these cross-task representations are on edge devices with limited resources and different noise levels. To this end, in this study we evaluate the robustness of four different versions of HuBERT, namely: base, large, and extra-large versions, as well as a recent version termed Robust-HuBERT. Tests are conducted under different additive and convolutive noise conditions for three downstream tasks: keyword spotting, intent classification, and emotion recognition. Our results show that while larger models can provide some important robustness to environmental factors, they may not be applicable to edge applications. Smaller models, on the other hand, showed substantial accuracy drops in noisy conditions, especially in the presence of room reverberation. These findings suggest that cross-task speech representations are not yet ready for edge applications and innovations are still needed.
翻译:无监督语音模型在语音和机器学习社区中日益普及。上游模型负责从原始音频中学习有意义的表示,随后这些表示作为输入传递给下游模型以解决诸如关键词识别或情感识别等任务。随着边缘语音应用开始兴起,评估这些跨任务表示在资源受限且噪声水平不同的边缘设备上的鲁棒性至关重要。为此,本研究评估了四种不同版本的HuBERT的鲁棒性,即:基础版、大型版、超大型版以及近期推出的Robust-HuBERT版本。测试在三种下游任务(关键词识别、意图分类和情感识别)的不同加性噪声和卷积噪声条件下进行。结果表明,虽然较大的模型能为环境因素提供一定的重要鲁棒性,但它们可能不适用于边缘应用。另一方面,较小的模型在噪声条件下,尤其是在存在房间混响时,显示出显著的精度下降。这些发现表明,跨任务语音表示尚未准备好用于边缘应用,仍需进一步创新。