Open-vocabulary keyword spotting (KWS) with text-based enrollment has emerged as a flexible alternative to fixed-phrase triggers. Prior utterance-level matching methods, from an embedding-learning standpoint, learn embeddings at a single fixed dimensionality. We depart from this design and propose Matryoshka Audio-Text Embeddings (MATE), a dual-encoder framework that encodes multiple embedding granularities within a single vector via nested sub-embeddings ("prefixes"). Specifically, we introduce a PCA-guided prefix alignment: PCA-compressed versions of the full text embedding for each prefix size serve as teacher targets to align both audio and text prefixes. This alignment concentrates salient keyword cues in lower-dimensional prefixes, while higher dimensions add detail. MATE is trained with standard deep metric learning objectives for audio-text KWS, and is loss-agnostic. To our knowledge, this is the first application of matryoshka-style embeddings to KWS, achieving state-of-the-art results on WSJ and LibriPhrase without any inference overhead.
翻译:基于文本注册的开放词汇关键词检测已成为固定短语触发器的灵活替代方案。从嵌入学习的角度来看,先前的话语级匹配方法仅学习单一固定维度的嵌入表示。我们突破这一设计范式,提出了套娃式音频-文本嵌入框架,该双编码器架构通过嵌套子嵌入(即“前缀”)在单个向量中编码多粒度嵌入信息。具体而言,我们引入了PCA引导的前缀对齐机制:针对每个前缀尺寸,将完整文本嵌入的PCA压缩版本作为教师目标,以对齐音频与文本前缀。这种对齐方式将显著的关键词线索集中于低维前缀,而更高维度则补充细节信息。MATE采用标准的深度度量学习目标进行音频-文本关键词检测训练,且具有损失函数无关性。据我们所知,这是套娃式嵌入在关键词检测领域的首次应用,在WSJ和LibriPhrase数据集上取得了最先进的性能,且无需任何推理开销。