How important are different temporal speech modulations for speech recognition? We answer this question from two complementary perspectives. Firstly, we quantify the amount of phonetic \textit{information} in the modulation spectrum of speech by computing the mutual information between temporal modulations with frame-wise phoneme labels. Looking from another perspective, we ask - which speech modulations an Automatic Speech Recognition (ASR) system prefers for its operation. Data-driven weights are learned over the modulation spectrum and optimized for an end-to-end ASR task. Both methods unanimously agree that speech information is mostly contained in slow modulation. Maximum mutual information occurs around 3-6 Hz which also happens to be the range of modulations most preferred by the ASR. In addition, we show that the incorporation of this knowledge into ASRs significantly reduces their dependency on the amount of training data.
翻译:语音中的不同时间调制对于语音识别的重要性如何?我们从两个互补的视角回答这一问题。首先,通过计算时间调制与帧级音素标签之间的互信息,我们量化了语音调制频谱中音位信息的含量。从另一视角出发,我们探究自动语音识别(ASR)系统在运行中更偏好哪些语音调制。通过数据驱动的权重学习,对调制频谱进行优化,使其服务于端到端ASR任务。两种方法一致表明,语音信息主要包含在慢速调制中。最大互信息出现在约3-6赫兹范围内,这也恰好是ASR最偏好的调制频段。此外,我们证明将这些知识融入ASR可显著降低其对训练数据量的依赖。