In human-computer conversations, extracting entities such as names, street addresses and email addresses from speech is a challenging task. In this paper, we study the impact of fine-tuning pre-trained speech encoders on extracting spoken entities in human-readable form directly from speech without the need for text transcription. We illustrate that such a direct approach optimizes the encoder to transcribe only the entity relevant portions of speech ignoring the superfluous portions such as carrier phrases, or spell name entities. In the context of dialog from an enterprise virtual agent, we demonstrate that the 1-step approach outperforms the typical 2-step approach which first generates lexical transcriptions followed by text-based entity extraction for identifying spoken entities.
翻译:在人机对话中,从语音中提取姓名、街道地址和电子邮件地址等实体是一项具有挑战性的任务。本文研究了微调预训练语音编码器对直接从语音中提取可读形式的语音实体(无需文本转录)的影响。我们表明,这种直接方法优化了编码器,使其仅转录语音中与实体相关的部分,而忽略多余部分(如引导短语或拼写姓名实体)。在企业虚拟代理的对话场景中,我们证明了一步法在识别语音实体方面的性能优于典型的两步法(即先生成词汇转录,随后进行基于文本的实体提取)。