Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.

翻译：虚拟助手的交互通常以触发短语开头，随后是命令。本研究探索了通过消除触发短语需求来使这些交互更加自然的可能性。我们的目标是基于设备麦克风录制的流式音频信号，判断用户是否在向虚拟助手说话。通过将自动语音识别系统的1-best假设和解码器信号与音频编码器的声学表示相结合，作为大型语言模型（LLM）的输入特征，我们解决了该任务。我们特别关注数据与资源高效的系统，这些系统仅需少量训练数据，并能在设备上仅有一个冻结的LLM的场景中运行。因此，我们的模型使用低秩适应和前缀微调的组合，在不超过8万个多模态数据样本上进行训练。我们将所提系统与单模态基线进行比较，结果显示多模态方法在仅使用少量训练数据的情况下实现了更低的等错误率（EER）。我们还发现，低维专用音频表示比高维通用音频表示能带来更低的EER。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日