Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.
翻译:虚拟助手的交互通常以触发短语开头,随后是命令。本研究探索了通过消除触发短语需求来使这些交互更加自然的可能性。我们的目标是基于设备麦克风录制的流式音频信号,判断用户是否在向虚拟助手说话。通过将自动语音识别系统的1-best假设和解码器信号与音频编码器的声学表示相结合,作为大型语言模型(LLM)的输入特征,我们解决了该任务。我们特别关注数据与资源高效的系统,这些系统仅需少量训练数据,并能在设备上仅有一个冻结的LLM的场景中运行。因此,我们的模型使用低秩适应和前缀微调的组合,在不超过8万个多模态数据样本上进行训练。我们将所提系统与单模态基线进行比较,结果显示多模态方法在仅使用少量训练数据的情况下实现了更低的等错误率(EER)。我们还发现,低维专用音频表示比高维通用音频表示能带来更低的EER。