When it comes to authentication in speaker verification systems, not all utterances are created equal. It is essential to estimate the quality of test utterances in order to account for varying acoustic conditions. In addition to the net-speech duration of an utterance, it is observed in this paper that phonetic richness is also a key indicator of utterance quality, playing a significant role in accurate speaker verification. Several phonetic histogram based formulations of phonetic richness are explored using transcripts obtained from an automatic speaker recognition system. The proposed phonetic richness measure is found to be positively correlated with voice authentication scores across evaluation benchmarks. Additionally, the proposed measure in combination with net speech helps in calibrating the speaker verification scores, obtaining a relative EER improvement of 5.8% on the Voxceleb1 evaluation protocol. The proposed phonetic richness based calibration provides higher benefit for short utterances with repeated words.
翻译:在说话人验证系统的身份认证过程中,并非所有语音段都具有同等价值。为应对不同的声学条件,对测试语音段的质量进行评估至关重要。本文研究发现,除语音段的净语音时长外,音素丰富度同样是衡量语音段质量的关键指标,对实现精确的说话人验证具有重要作用。本研究基于自动语音识别系统生成的转写文本,探索了多种基于音素直方图的音素丰富度量化方法。实验结果表明,所提出的音素丰富度度量与多个评估基准中的声纹认证得分呈正相关。此外,该度量与净语音时长相结合可有效校准说话人验证得分,在Voxceleb1评估协议上实现了5.8%的相对等错误率提升。所提出的基于音素丰富度的校准方法对包含重复词汇的短语音段具有更显著的改善效果。