Speech and audio systems operate in inherently non-stationary environments, yet continual learning (CL) research in this domain, especially in the foundation model era, remains fragmented that fail to account for the coupled, geometry-sensitive nature of acoustic representations. Modern speech foundation models operate over highly entangled, continuous representations that jointly encode linguistic, speaker, and paralinguistic factors within a shared latent space. CL is therefore fundamentally about preserving and evolving shared representation structure rather than retaining isolated task knowledge. In this work, we revisit CL for speech from a representation-centered perspective, and introduce a new taxonomy that organizes CL according to how underlying representation geometry evolves under non-stationary acoustic conditions. We further identify key mismatches between current CL assumptions and speech foundation model behavior, and finally outline a set of open challenges and future research directions.
翻译:语音与音频系统运行于本质上非平稳的环境中,然而该领域的持续学习研究(尤其是在基础模型时代)仍较为碎片化,未能充分考虑到声学表征的耦合性与几何敏感性。现代语音基础模型处理的是高度纠缠的连续表征,这些表征在共享的潜在空间内共同编码了语言、说话者及副语言因素。因此,持续学习的核心在于保持和进化共享的表征结构,而非保留孤立的任务知识。本文从表征中心的视角重新审视语音领域的持续学习,并引入了一种新的分类体系,该体系根据底层表征几何结构在非平稳声学条件下的演化方式进行组织。我们进一步指出了当前持续学习假设与语音基础模型行为之间的关键错位,并最终概述了一系列开放挑战与未来研究方向。