This paper discusses one of the most challenging practical engineering problems in speaker recognition systems - the version control of models and user profiles. A typical speaker recognition system consists of two stages: the enrollment stage, where a profile is generated from user-provided enrollment audio; and the runtime stage, where the voice identity of the runtime audio is compared against the stored profiles. As technology advances, the speaker recognition system needs to be updated for better performance. However, if the stored user profiles are not updated accordingly, version mismatch will result in meaningless recognition results. In this paper, we describe different version control strategies for speaker recognition systems that had been carefully studied at Google from years of engineering practice. These strategies are categorized into three groups according to how they are deployed in the production environment: device-side deployment, server-side deployment, and hybrid deployment. To compare different strategies with quantitative metrics under various network configurations, we present SpeakerVerSim, an easily-extensible Python-based simulation framework for different server-side deployment strategies of speaker recognition systems.
翻译:本文探讨了说话人识别系统中最具挑战性的实际工程问题之一——模型与用户档案的版本控制。典型的说话人识别系统包含两个阶段:注册阶段,即根据用户提供的注册音频生成档案;以及运行时阶段,即将运行时音频的语音身份与存储的档案进行比对。随着技术进步,说话人识别系统需要更新以获得更好的性能。然而,若存储的用户档案未相应更新,版本不匹配将导致无意义的识别结果。本文阐述了谷歌基于多年工程实践深入研究的多种说话人识别系统版本控制策略。根据在生产环境中的部署方式,这些策略可分为三类:设备端部署、服务器端部署以及混合部署。为了在不同网络配置下通过量化指标比较各种策略,我们提出了SpeakerVerSim——一个易于扩展的、基于Python的说话人识别系统服务器端部署策略仿真框架。