Wake word detection exists in most intelligent homes and portable devices. It offers these devices the ability to "wake up" when summoned at a low cost of power and computing. This paper focuses on understanding alignment's role in developing a wake-word system that answers a generic phrase. We discuss three approaches. The first is alignment-based, where the model is trained with frame-wise cross-entropy. The second is alignment-free, where the model is trained with CTC. The third, proposed by us, is a hybrid solution in which the model is trained with a small set of aligned data and then tuned with a sizeable unaligned dataset. We compare the three approaches and evaluate the impact of the different aligned-to-unaligned ratios for hybrid training. Our results show that the alignment-free system performs better than the alignment-based for the target operating point, and with a small fraction of the data (20%), we can train a model that complies with our initial constraints.
翻译:唤醒词检测广泛应用于智能家居和便携式设备中,使设备能够以较低的功耗和计算成本在被召唤时"唤醒"。本文聚焦于理解对齐机制在构建响应通用短语的唤醒词系统中的作用。我们探讨了三种方法:第一种是基于对齐的方法,采用帧级交叉熵训练模型;第二种是免对齐方法,采用CTC训练模型;第三种是我们提出的混合方案,先用少量对齐数据训练模型,再通过大量未对齐数据集进行微调。我们比较了这三种方法,并评估了混合训练中不同对齐-未对齐数据比例的影响。结果表明:在目标工作点上,免对齐系统的性能优于基于对齐的系统;仅需20%的少量数据,即可训练出满足初始约束条件的模型。