Keyword spotting systems often struggle to generalize to a diverse population with various accents and age groups. To address this challenge, we propose a novel approach that integrates speaker information into keyword spotting using Feature-wise Linear Modulation (FiLM), a recent method for learning from multiple sources of information. We explore both Text-Dependent and Text-Independent speaker recognition systems to extract speaker information, and we experiment on extracting this information from both the input audio and pre-enrolled user audio. We evaluate our systems on a diverse dataset and achieve a substantial improvement in keyword detection accuracy, particularly among underrepresented speaker groups. Moreover, our proposed approach only requires a small 1% increase in the number of parameters, with a minimum impact on latency and computational cost, which makes it a practical solution for real-world applications.
翻译:关键词检测系统在处理口音和年龄层多样化的用户群体时,往往难以取得良好效果。为解决这一问题,我们提出一种创新方法,利用特征级线性调制(FiLM)——一种多源信息学习技术——将说话人信息融入关键词检测。我们探索了基于文本相关和文本无关的说话人识别系统来提取说话人信息,并尝试从输入音频和预注册用户音频中提取此类特征。在多样化数据集上的评估表明,我们的方法显著提升了关键词检测准确率,尤其在代表性不足的说话人群体中效果更为突出。此外,所提方法仅需增加1%的参数数量,对时延和计算成本影响极小,因此在实际应用中具有较高的可行性。