This paper rethink some aspects of speech processing using speech encoders, specifically about extracting entities directly from speech, without intermediate textual representation. In human-computer conversations, extracting entities such as names, street addresses and email addresses from speech is a challenging task. In this paper, we study the impact of fine-tuning pre-trained speech encoders on extracting spoken entities in human-readable form directly from speech without the need for text transcription. We illustrate that such a direct approach optimizes the encoder to transcribe only the entity relevant portions of speech ignoring the superfluous portions such as carrier phrases, or spell name entities. In the context of dialog from an enterprise virtual agent, we demonstrate that the 1-step approach outperforms the typical 2-step approach which first generates lexical transcriptions followed by text-based entity extraction for identifying spoken entities.
翻译:本文重新审视了使用语音编码器进行语音处理的一些方面,特别是无需中间文本表示即可直接从语音中提取实体。在人机对话中,从语音中提取姓名、街道地址和电子邮件地址等实体是一项具有挑战性的任务。本文研究了微调预训练语音编码器对直接从语音中提取可读形式语音实体的影响,而无需文本转录。我们证明了这种直接方法能优化编码器,使其仅转录与实体相关的语音部分,忽略多余部分(如载波短语或拼写名称实体)。在企业虚拟助手的对话场景中,我们展示了该单步方法在识别语音实体方面优于典型的双步方法(先生成词汇转录,再进行基于文本的实体提取)。