This paper discusses one of the most challenging practical engineering problems in speaker recognition systems - the version control of models and user profiles. A typical speaker recognition system consists of two stages: the enrollment stage, where a profile is generated from user-provided enrollment audio; and the runtime stage, where the voice identity of the runtime audio is compared against the stored profiles. As technology advances, the speaker recognition system needs to be updated for better performance. However, if the stored user profiles are not updated accordingly, version mismatch will result in meaningless recognition results. In this paper, we describe different version control strategies for speaker recognition systems that had been carefully studied at Google from years of engineering practice. These strategies are categorized into three groups according to how they are deployed in the production environment: device-side deployment, server-side deployment, and hybrid deployment. To compare different strategies with quantitative metrics under various network configurations, we present SpeakerVerSim, an easily-extensible Python-based simulation framework for different server-side deployment strategies of speaker recognition systems.
翻译:本文讨论了说话人识别系统中最具挑战性的工程实践问题之一——模型与用户配置文件的版本控制。典型的说话人识别系统包含两个阶段:注册阶段(根据用户提供的注册语音生成配置文件)和运行阶段(将运行语音的声纹身份与已存储的配置文件进行比对)。随着技术进步,说话人识别系统需要更新以获得更优性能。然而,若未同步更新已存储的用户配置文件,版本不匹配将导致无意义的识别结果。本文描述了谷歌经过多年工程实践精心研究的多种说话人识别系统版本控制策略。根据在生产环境中的部署方式,这些策略被归为三类:设备端部署、服务器端部署和混合部署。为在不同网络配置下通过量化指标比较不同策略,我们提出了SpeakerVerSim——一个基于Python、易于扩展的仿真框架,专门用于分析说话人识别系统的各类服务器端部署策略。