In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Additionally, we present task transfer learning to a new task within an existing TokenVerse.
翻译:在传统的语音对话智能系统中,通常采用级联式处理流程,涉及语音活动检测、说话人日志、转写等任务,并需使用不同的自然语言处理模型进行后续处理,如语义端点检测和命名实体识别。本文提出TokenVerse,一种基于Transducer的单一模型,旨在处理多类任务。该方法通过在自动语音识别模型训练阶段将任务特定标记融入参考文本,从而简化推理流程并消除对独立自然语言处理模型的需求。除自动语音识别外,我们在三项不同任务上开展实验:说话人转换检测、端点检测和命名实体识别。在公开数据集与私有数据集上的实验表明,所提方法将自动语音识别的相对词错误率最高降低7.7%,同时在各项独立任务性能上均优于级联式流程方法。此外,我们展示了在现有TokenVerse框架内实现新任务迁移学习的能力。