This paper discusses one of the most challenging practical engineering problems in speaker recognition systems - the version control of models and user profiles. A typical speaker recognition system consists of two stages: the enrollment stage, where a profile is generated from user-provided enrollment audio; and the runtime stage, where the voice identity of the runtime audio is compared against the stored profiles. As technology advances, the speaker recognition system needs to be updated for better performance. However, if the stored user profiles are not updated accordingly, version mismatch will result in meaningless recognition results. In this paper, we describe different version control strategies for speaker recognition systems that had been carefully studied at Google from years of engineering practice. These strategies are categorized into three groups according to how they are deployed in the production environment: device-side deployment, server-side deployment, and hybrid deployment. To compare different strategies with quantitative metrics under various network configurations, we present SpeakerVerSim, an easily-extensible Python-based simulation framework for different server-side deployment strategies of speaker recognition systems.
翻译:本文探讨了说话人识别系统中最具挑战性的实际工程问题之一——模型与用户配置文件的版本控制。典型的说话人识别系统包含两个阶段:注册阶段,系统根据用户提供的注册音频生成配置文件;运行时阶段,系统将运行时音频的声纹身份与已存储的配置文件进行比对。随着技术进步,说话人识别系统需通过更新以提升性能。然而,若未同步更新已存储的用户配置文件,版本不匹配将导致识别结果失去意义。本文描述了谷歌通过多年工程实践深入研究的多种说话人识别系统版本控制策略,并根据其在生产环境中的部署方式分为三类:设备端部署、服务器端部署与混合部署。为在不同网络配置下通过量化指标比较各策略,我们提出了SpeakerVerSim——一个基于Python、易于扩展的仿真框架,专用于评估说话人识别系统的不同服务器端部署策略。