In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with standard frame-synchronous features on English speechocean762 and low-resource Tamil datasets.
翻译:在语音评估中,自动语音识别(ASR)模型通常需要为输入特征计算时间边界和音素后验概率。然而,ASR训练数据的稀缺阻碍了语音评估向低资源语言的扩展。开源弱监督模型能够在多种语言中执行ASR,但它们是非帧同步且非音素级别的,这妨碍了语音评估的特征提取。本文提出了一种方法,以克服使用弱监督模型进行特征提取时的不兼容性问题,从而简化语音评估向低资源语言的扩展。通过将ASR假设映射到音素混淆网络来计算音素后验概率。采用词语级别而非音素级别的语速和时长。利用交叉注意力架构结合音素级和帧级特征,从而避免了音素时间对齐。该方法在英语SpeechOcean762和低资源泰米尔语数据集上的表现与标准帧同步特征相当。