This paper reimagines some aspects of speech processing using speech encoders, specifically about extracting entities directly from speech, with no intermediate textual representation. In human-computer conversations, extracting entities such as names, postal addresses and email addresses from speech is a challenging task. In this paper, we study the impact of fine-tuning pre-trained speech encoders on extracting spoken entities in human-readable form directly from speech without the need for text transcription. We illustrate that such a direct approach optimizes the encoder to transcribe only the entity relevant portions of speech, ignoring the superfluous portions such as carrier phrases and spellings of entities. In the context of dialogs from an enterprise virtual agent, we demonstrate that the 1-step approach outperforms the typical 2-step cascade of first generating lexical transcriptions followed by text-based entity extraction for identifying spoken entities.
翻译:本文重新审视了语音处理中涉及语音编码器的若干方面,特别是直接从语音中抽取实体,无需中间文本表征。在人机对话中,从语音中抽取姓名、邮政地址和电子邮件地址等实体是一项具有挑战性的任务。本文研究了通过微调预训练语音编码器直接从语音中抽取人类可读形式的实体(无需文本转写)的影响。我们表明,这种直接方法能够优化编码器,使其仅转录与实体相关的语音部分,忽略冗余部分(如承载短语和实体的拼读)。在企业虚拟助手对话场景下,我们证明:与先产生词汇转录再基于文本进行实体抽取的典型两步级联方法相比,这种一步式方法在识别口语实体方面具有更优的性能。