Wake word detection exists in most intelligent homes and portable devices. It offers these devices the ability to "wake up" when summoned at a low cost of power and computing. This paper focuses on understanding alignment's role in developing a wake-word system that answers a generic phrase. We discuss three approaches. The first is alignment-based, where the model is trained with frame-wise cross-entropy. The second is alignment-free, where the model is trained with CTC. The third, proposed by us, is a hybrid solution in which the model is trained with a small set of aligned data and then tuned with a sizeable unaligned dataset. We compare the three approaches and evaluate the impact of the different aligned-to-unaligned ratios for hybrid training. Our results show that the alignment-free system performs better alignment-based for the target operating point, and with a small fraction of the data (20%), we can train a model that complies with our initial constraints.
翻译:唤醒词检测存在于大多数智能家居和便携式设备中。它使这些设备能够在被召唤时以低功耗和低计算成本"唤醒"。本文重点理解对齐在构建响应通用短语的唤醒词系统中的作用。我们讨论了三种方法:第一种是基于对齐的方法,即使用逐帧交叉熵训练模型;第二种是无对齐方法,即使用CTC训练模型;第三种是我们提出的混合解决方案,先用少量对齐数据训练模型,再用大量未对齐数据集进行微调。我们比较了三种方法,并评估了混合训练中不同对齐与未对齐数据比例的影响。结果表明,在目标工作点,无对齐系统性能优于基于对齐系统,且仅需20%的数据即可训练出符合初始约束的模型。