The development of high-performance, on-device keyword spotting (KWS) systems for ultra-low-power hardware is critically constrained by the scarcity of specialized, multi-command training datasets. Traditional data collection through human recording is costly, slow, and lacks scalability. This paper introduces SYNTTS-COMMANDS, a novel, multilingual voice command dataset entirely generated using state-of-the-art Text-to-Speech (TTS) synthesis. By leveraging the CosyVoice 2 model and speaker embeddings from public corpora, we created a scalable collection of English and Chinese commands. Extensive benchmarking across a range of efficient acoustic models demonstrates that our synthetic dataset enables exceptional accuracy, achieving up to 99.5\% on English and 98\% on Chinese command recognition. These results robustly validate that synthetic speech can effectively replace human-recorded audio for training KWS classifiers. Our work directly addresses the data bottleneck in TinyML, providing a practical, scalable foundation for building private, low-latency, and energy-efficient voice interfaces on resource-constrained edge devices.
翻译:高性能、端侧关键词唤醒系统在超低功耗硬件上的开发,严重受限于专用多指令训练数据集的稀缺性。传统通过人工录音收集数据的方式成本高昂、速度缓慢且缺乏可扩展性。本文介绍了SYNTTS-COMMANDS,一个完全利用最先进的文本到语音合成技术生成的全新多语言语音指令数据集。通过利用CosyVoice 2模型及来自公共语料库的说话人嵌入,我们创建了一个可扩展的英语和中文指令集合。在一系列高效声学模型上的广泛基准测试表明,我们的合成数据集能够实现卓越的识别准确率,在英语和中文指令识别上分别达到99.5%和98%。这些结果有力地验证了合成语音可以有效替代人工录制音频用于训练KWS分类器。我们的工作直接解决了TinyML领域的数据瓶颈,为在资源受限的边缘设备上构建私密、低延迟且高能效的语音接口提供了实用且可扩展的基础。