Short-duration speaker verification (SDSV) is crucial for personalized keyword spotting, where test utterances are typically shorter than three seconds. Limited speech duration results in unstable speaker representations and increased sensitivity to noise and phoneme variations, thereby degrading performance. To investigate this issue, we construct VoxPhrase, a large-scale SDSV corpus automatically segmented from the VoxCeleb dataset. Our analysis shows that text-dependent (TD) enrollment is constrained by duration and yields unstable speaker representations. In contrast, although text-independent (TI) enrollment introduces content mismatch, its representations become more stable as the enrollment duration increases. Accordingly, we propose a hybrid-enrollment neural re-scoring framework that combines TD and TI enrollment and performs frame-level comparison via parallel cross-attention. Experiments on VoxPhrase demonstrate consistent improvements across multiple speaker models.
翻译:短时长说话人验证(SDSV)对于个性化关键词唤醒至关重要,其测试语音时长通常短于三秒。有限的语音时长会导致说话人表征不稳定,并增加对噪声及音素变化的敏感性,从而降低系统性能。为探究该问题,我们构建了VoxPhrase——一个从VoxCeleb数据集自动分割的大规模SDSV语料库。分析表明,文本相关(TD)注册受时长限制会产生不稳定的说话人表征;而文本无关(TI)注册虽引入内容不匹配问题,但其表征会随注册时长增加而趋于稳定。据此,我们提出一种混合注册神经重评分框架,该框架融合TD与TI注册方式,通过并行交叉注意力机制实现帧级比对。在VoxPhrase上的实验表明,该方法能在多种说话人模型上取得一致性的性能提升。