This paper proposes a novel cross-utterance audio-textual prompts based speaker adaptation approach for elderly speech recognition. It enables zero-shot, real-time adaptation to unseen speakers. Speech and text embeddings are extracted from the current and a few preceding utterances, before being fused in a cross-modal manner to produce compact speaker prompts that are more consistent than i/x-vectors and ECAPA-TDNN features. Experiments on the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets suggest that the proposed online adaptation outperforms the speaker-independent (SI) model by statistically significant word error rate (WER) or character error rate (CER) reductions of 0.61% and 1.22% absolute (2.99% and 4.48% relative). Real-time factor (RTF) speed-up ratios of up to 9.83 times are obtained over offline batch-mode adaptation.
翻译:本文提出一种基于跨语句音频-文本提示的说话人自适应方法,用于老年语音识别。该方法能够实现对未见说话人的零样本、实时自适应。通过从当前语句及前序若干语句中提取语音和文本嵌入,并以跨模态方式融合生成紧凑的说话人提示,所生成的提示相比i/x-vector和ECAPA-TDNN特征具有更好的一致性。在英语DementiaBank Pitt和粤语JCCOCC MoCA老年语音数据集上的实验表明,所提出的在线自适应方法相比说话人无关模型,在词错误率(WER)或字符错误率(CER)上实现了统计显著的绝对降幅0.61%和1.22%(相对降幅2.99%和4.48%)。相比离线批处理自适应,实时因子(RTF)加速比最高可达9.83倍。