Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) system, such as 1-best hypotheses, as input features to a large language model (LLM). Finally, we explore a multimodal system that combines acoustic and lexical features, as well as ASR decoder signals in an LLM. Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%. Increasing the size of the LLM and training with low-rank adaption leads to further relative EER reductions of up to 18% on our dataset.
翻译:与虚拟助手的交互通常以预定义的触发短语开始,随后是用户命令。为使与助手的交互更直观,我们探索是否可以取消用户每条命令必须以触发短语开头的要求。我们通过三种方式研究这一任务:首先,仅利用从音频波形中获取的声学信息训练分类器;其次,将自动语音识别(ASR)系统的解码器输出(如1-best假设)作为输入特征输入大语言模型(LLM);最后,探索一种多模态系统,在LLM中结合声学特征、词汇特征以及ASR解码器信号。使用多模态信息相较于纯文本和纯音频模型,相对等错误率(EER)改进分别高达39%和61%。增大LLM规模并采用低秩适配进行训练,可进一步使数据集上的相对EER降低高达18%。