Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.
翻译:自动语音识别(ASR)在标准语音领域取得了显著进展,但针对神经系统疾病导致的病理语音识别仍面临重大挑战。我们研究了通过特征级线性调制(FiLM)实现说话人条件化的方法,将基于x-vector的说话人信息注入冻结ASR编码器的每个Transformer层,从而在不修改基础模型权重的情况下,使内部表示适应特定病理说话人的语音特征。我们以西班牙语和英语病理语音为测试对象,将这一方法在ASR任务上与标准微调及参数高效微调基线方法进行了对比基准测试,并辅以后处理技术。此外,我们还评估了适配模型是否保留了回答语音相关问题(如语义理解)的能力。结果表明,经过说话人条件化的ASR模型在保持非条件化语音处理性能的同时,能与已有适配策略相抗衡。