Short-utterance speaker verification remains challenging due to limited speaker-discriminative cues in short speech segments. While existing methods focus on enhancing speaker encoders, the embedding learning strategy still forces a single fixed-dimensional representation reused for utterances of any length, leaving capacity misaligned with the information available at different durations. We propose Duration-Aware Matryoshka Embedding (DAME), a model-agnostic framework that builds a nested hierarchy of sub-embeddings aligned to utterance durations: lower-dimensional representations capture compact speaker traits from short utterances, while higher dimensions encode richer details from longer speech. DAME supports both training from scratch and fine-tuning, and serves as a direct alternative to conventional large-margin fine-tuning, consistently improving performance across durations. On the VoxCeleb1-O/E/H and VOiCES evaluation sets, DAME consistently reduces the equal error rate on 1-s and other short-duration trials, while maintaining full-length performance with no additional inference cost. These gains generalize across various speaker encoder architectures under both general training and fine-tuning setups.
翻译:短语音说话人验证由于短语音段中说话人区分性线索有限而仍然具有挑战性。虽然现有方法侧重于增强说话人编码器,但嵌入学习策略仍然强制使用单一固定维度的表示,并将其复用于任意时长的语音,导致模型容量与不同时长下可用信息不匹配。我们提出了时长感知嵌套嵌入(DAME),这是一个与模型无关的框架,它构建了一个与语音时长对齐的嵌套子嵌入层次结构:较低维度的表示从短语音中捕捉紧凑的说话人特征,而较高维度则从较长语音中编码更丰富的细节。DAME 支持从头开始训练和微调,并可作为传统大间隔微调的直接替代方案,持续提升跨时长的性能。在 VoxCeleb1-O/E/H 和 VOiCES 评估集上,DAME 持续降低了 1 秒及其他短时长测试的等错误率,同时在不增加推理成本的情况下保持了全长语音的性能。这些性能提升在通用训练和微调设置下,可推广到多种说话人编码器架构。