Personalizing dysarthric ASR is hindered by demanding enrollment collection and per-user training. We propose a hybrid meta-training method for a single model, enabling zero-shot and few-shot on-the-fly personalization via in-context learning (ICL). On Euphonia, it achieves 13.9% Word Error Rate (WER), surpassing speaker-independent baselines (17.5%). On SAP Test-1, our 5.3% WER outperforms the challenge-winning team (5.97%). On Test-2, our 9.49% trails only the winner (8.11%) but without relying on techniques like offline model-merging or custom audio chunking. Curation yields a 40% WER reduction using random same-speaker examples, validating active personalization. While static text curation fails to beat this baseline, oracle similarity reveals substantial headroom, highlighting dynamic acoustic retrieval as the next frontier. Data ablations confirm rapid low-resource speaker adaptation, establishing the model as a practical personalized solution.
翻译:构音障碍语音识别的个性化面临注册数据采集困难与单用户训练成本高昂的挑战。本文提出一种混合元训练方法,通过单模型实现基于上下文学习的零样本与少样本实时个性化。在Euphonia数据集上,该方法词错误率降至13.9%,优于说话人无关基线(17.5%)。在SAP Test-1测试集上,5.3%的词错误率超越了挑战赛获胜团队(5.97%)。在Test-2测试集上,9.49%的词错误率仅次于获胜方案(8.11%),且无需依赖离线模型融合或定制音频分块等技术。实验表明:通过随机选取同说话人样本进行数据筛选可实现40%的词错误率降低,验证了主动个性化的有效性。静态文本筛选虽未能超越该基线,但通过理想相似度分析揭示了显著的优化空间,表明动态声学检索是未来关键研究方向。数据消融实验证实了模型在低资源说话人场景下的快速适应能力,确立了其作为实用个性化解决方案的可行性。