A Multilingual Keyword Spotting (KWS) system detects spokenkeywords over multiple locales. Conventional monolingual KWSapproaches do not scale well to multilingual scenarios because ofhigh development/maintenance costs and lack of resource sharing.To overcome this limit, we propose two locale-conditioned universalmodels with locale feature concatenation and feature-wise linearmodulation (FiLM). We compare these models with two baselinemethods: locale-specific monolingual KWS, and a single universalmodel trained over all data. Experiments over 10 localized languagedatasets show that locale-conditioned models substantially improveaccuracy over baseline methods across all locales in different noiseconditions.FiLMperformed the best, improving on average FRRby 61% (relative) compared to monolingual KWS models of similarsizes.
翻译:多语言关键词识别(KWS)系统能够在多个语言区域中检测语音关键词。传统的单语KWS方法由于开发维护成本高且缺乏资源共享,难以有效扩展至多语言场景。为克服这一限制,我们提出了两种基于区域条件的通用模型,分别采用区域特征拼接和基于特征线性调制(FiLM)的方法。我们将这些模型与两种基线方法进行对比:区域特定单语KWS模型,以及基于全部数据训练的单一通用模型。在10个本地化语言数据集上的实验表明,区域条件模型在不同噪声条件下均能显著提升所有区域的识别准确率。其中,FiLM模型表现最优,与同等规模的单语KWS模型相比,其平均误拒率(FRR)相对降低了61%。