Open-vocabulary keyword spotting (OV-KWS) enables personalized device control via arbitrary voice commands. Recently, researchers have explored using audio-text joint embeddings, allowing users to enroll phrases with text, and proposed techniques to disambiguate similar utterances. We find that existing OV-KWS solutions often overly bias the beginning phonemes of an enrollment, causing false triggers when negative enrollment-query-pairs share a prefix (``turn the volume up'' vs. ``turn the volume down''). We trace this to two factors: training data bias and position-biased cross-modal scoring. To address these limitations, we introduce the Partial Overlap Benchmark (POB) with two datasets, POB-Spark and POB-LibriPhrase (POB-LP), containing mismatched audio-text pairs with shared prefixes, and propose Equal-weighting Position Scoring (EPS), a lightweight decision layer. Using EPS alone reduces EER on POB-Spark from 64.4\% to 29.3\% and improves POB-LP accuracy from 87.6\% to 96.8\%, while maintaining performance on LibriPhrase and Google Speech Commands (GSC). With POB data added in training, our work achieves the best POB benchmark results while incurring the least amount of degradation on prior metrics among baselines. This degradation is most pronounced in GSC, which contains only one-word commands. We surface mitigating this trade-off as future work.
翻译:开放词汇关键词检测(OV-KWS)通过任意语音指令实现个性化设备控制。近期,研究者探索利用音频-文本联合嵌入技术,允许用户通过文本录入短语,并提出了消解相似语音歧义的方法。我们发现,现有OV-KWS解决方案往往过度偏重录入短语的起始音素,导致当负向录入-查询对共享前缀时(如“调高音量”与“调低音量”)出现误触发。经分析,该问题源于两个因素:训练数据偏见与位置偏置的跨模态评分机制。为应对这些局限,我们提出包含POB-Spark与POB-LibriPhrase(POB-LP)两个数据集的局部重叠基准测试(POB),其中包含具有共享前缀的不匹配音频-文本对;同时提出轻量级决策层——等权位置评分(EPS)。仅使用EPS即可将POB-Spark的等错误率从64.4%降至29.3%,并将POB-LP准确率从87.6%提升至96.8%,同时在LibriPhrase与Google语音指令(GSC)数据集上保持性能。当在训练中加入POB数据后,我们的方法在POB基准测试中取得最优结果,且在基线模型中先前指标的性能下降最小。该性能下降在仅包含单词语令的GSC数据集中最为显著。我们提出缓解此权衡问题作为未来研究方向。