Detecting occurrences of keywords with keyword spotting (KWS) systems requires thresholding continuous detection scores. Selecting appropriate thresholds is a non-trivial task, typically relying on optimizing performance on a validation dataset. However, such greedy threshold selection often leads to suboptimal performance on unseen data, particularly in varying or noisy acoustic environments or few-shot settings. In this work, we investigate detection threshold estimation for template-based open-set few-shot KWS using dynamic time warping on noisy speech data. To mitigate the performance degradation caused by suboptimal thresholds, we propose a score calibration approach that operates at the embedding level by quantizing learned representations and applying quantization error-based normalization prior to DTW-based scoring and thresholding. Experiments on KWS-DailyTalk with simulated high frequency radio channels show that the proposed calibration approach simplifies the selection of robust detection thresholds and significantly improves the resulting performance.
翻译:在关键词检测系统中检测关键词出现需要为连续检测分数设定阈值。选择合适的阈值是一项非平凡的任务,通常依赖于在验证数据集上优化性能。然而,这种贪婪的阈值选择方法往往导致在未见数据上性能欠佳,特别是在变化或嘈杂的声学环境或少样本设置中。本研究针对基于模板的开放集少样本关键词检测,在噪声语音数据上使用动态时间规整进行检测阈值估计研究。为缓解因次优阈值导致的性能下降,我们提出一种分数校准方法,该方法在嵌入层面进行操作:通过量化学习到的表征,并在基于动态时间规整的评分和阈值处理前应用基于量化误差的归一化。在模拟高频无线电信道的KWS-DailyTalk数据集上的实验表明,所提出的校准方法简化了鲁棒检测阈值的选择过程,并显著提升了最终性能。