No Word Left Behind: Mitigating Prefix Bias in Open-Vocabulary Keyword Spotting

Open-vocabulary keyword spotting (OV-KWS) enables personalized device control via arbitrary voice commands. Recently, researchers have explored using audio-text joint embeddings, allowing users to enroll phrases with text, and proposed techniques to disambiguate similar utterances. We find that existing OV-KWS solutions often overly bias the beginning phonemes of an enrollment, causing false triggers when negative enrollment-query-pairs share a prefix (``turn the volume up'' vs. ``turn the volume down''). We trace this to two factors: training data bias and position-biased cross-modal scoring. To address these limitations, we introduce the Partial Overlap Benchmark (POB) with two datasets, POB-Spark and POB-LibriPhrase (POB-LP), containing mismatched audio-text pairs with shared prefixes, and propose Equal-weighting Position Scoring (EPS), a lightweight decision layer. Using EPS alone reduces EER on POB-Spark from 64.4\% to 29.3\% and improves POB-LP accuracy from 87.6\% to 96.8\%, while maintaining performance on LibriPhrase and Google Speech Commands (GSC). With POB data added in training, our work achieves the best POB benchmark results while incurring the least amount of degradation on prior metrics among baselines. This degradation is most pronounced in GSC, which contains only one-word commands. We surface mitigating this trade-off as future work.

翻译：开放词汇关键词检测（OV-KWS）通过任意语音指令实现个性化设备控制。近期，研究者探索了使用音频-文本联合嵌入技术，允许用户通过文本录入短语，并提出了消除相似语音歧义的方法。我们发现，现有OV-KWS解决方案往往对录入短语的起始音素存在过度偏见，导致当负向录入-查询对共享相同前缀时（例如“turn the volume up”与“turn the volume down”）产生误触发。我们追溯其根源至两个因素：训练数据偏见和位置偏见的跨模态评分机制。为应对这些局限，我们引入了包含两个数据集（POB-Spark与POB-LibriPhrase（POB-LP））的部分重叠基准测试（POB），这些数据集包含具有共享前缀的不匹配音频-文本对，并提出了轻量级决策层——等权重位置评分（EPS）。仅使用EPS即可将POB-Spark的等错误率从64.4%降低至29.3%，并将POB-LP的准确率从87.6%提升至96.8%，同时在LibriPhrase和Google Speech Commands（GSC）数据集上保持性能。通过在训练中加入POB数据，我们的工作在POB基准测试中取得了最佳结果，且在基线方法中对原有指标的负面影响最小。这种负面影响在仅包含单词语令的GSC数据集中最为显著。我们将缓解这种权衡关系作为未来研究方向。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

零训练开放词汇语义分割综述

专知会员服务

11+阅读 · 2025年5月31日

【NeurIPS2024】无需3D数据的开放词汇单目3D物体检测模型训练

专知会员服务

17+阅读 · 2024年11月26日

【CVPR2024】SHiNe：用于开放词汇目标检测的语义层次枢纽

专知会员服务

14+阅读 · 2024年5月18日

什么是开放词汇检测？港科大等最新《开放词汇检测和分割综述：过去、现在与未来》

专知会员服务

28+阅读 · 2023年7月21日