Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues, e.g acoustic, text and/or automatic speech recognition system (ASR) features, to classify speech as device-directed or otherwise, and often have to contend with one or more of these modalities being unavailable when deployed in real-world settings. In this paper, we investigate fusion schemes for DDSD systems that can be made more robust to missing modalities. Concurrently, we study the use of non-verbal cues, specifically prosody features, in addition to verbal cues for DDSD. We present different approaches to combine scores and embeddings from prosody with the corresponding verbal cues, finding that prosody improves DDSD performance by upto 8.5% in terms of false acceptance rate (FA) at a given fixed operating point via non-linear intermediate fusion, while our use of modality dropout techniques improves the performance of these models by 7.4% in terms of FA when evaluated with missing modalities during inference time.
翻译:设备指向性语音检测(DDSD)是一项二分类任务,旨在区分用户对语音助手的查询指令与旁白对话或背景语音。当前最先进的DDSD系统依赖言语线索(如声学特征、文本特征和/或自动语音识别系统特征)对语音是否指向设备进行分类,但在实际部署场景中常面临一种或多种模态不可用的情况。本文研究了能够增强对缺失模态鲁棒性的DDSD系统融合方案,同时探讨了除言语线索外,非言语线索(特别是韵律特征)在DDSD中的应用。我们提出了不同方法,将韵律特征生成的分数和嵌入与对应言语线索进行融合,实验发现通过非线性中间融合,韵律特征在固定操作点下可将误接受率(FA)提升高达8.5%;而采用模态丢弃技术后,在推理阶段存在模态缺失的情况下,模型FA性能进一步提升了7.4%。