Speech language models have recently demonstrated great potential as universal speech processing systems. Such models have the ability to model the rich acoustic information existing in audio signals, beyond spoken content, such as emotion, background noise, etc. Despite this, evaluation benchmarks which evaluate awareness to a wide range of acoustic aspects, are lacking. To help bridge this gap, we introduce SALMon, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response. The proposed benchmarks both evaluate the consistency of the inspected element and how much it matches the spoken text. We follow a modelling based approach, measuring whether a model gives correct samples higher scores than incorrect ones. This approach makes the benchmark fast to compute even for large models. We evaluated several speech language models on SALMon, thus highlighting the strengths and weaknesses of each evaluated method. We make the code and data publicly available at https://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .
翻译:语音语言模型近期展现出作为通用语音处理系统的巨大潜力。此类模型能够建模音频信号中存在的丰富声学信息,超越语音内容本身,例如情感、背景噪声等。尽管如此,目前仍缺乏能够评估模型对广泛声学方面感知能力的评测基准。为弥补这一空白,我们提出了SALMon——一个涵盖背景噪声、情感、说话人身份和房间脉冲响应的新型评估套件。所提出的基准既评估被检测要素的一致性,也评估其与语音文本的匹配程度。我们采用基于建模的方法,通过测量模型是否给予正确样本比错误样本更高的分数来进行评估。该方法使得基准测试即使对于大型模型也能快速计算。我们在SALMon上评估了多种语音语言模型,从而揭示了每种被评估方法的优势与不足。相关代码与数据已在https://pages.cs.huji.ac.il/adiyoss-lab/salmon/ 公开。