Automatic Speech Recognition (ASR) technology is fundamental in transcribing spoken language into text, with considerable applications in the clinical realm, including streamlining medical transcription and integrating with Electronic Health Record (EHR) systems. Nevertheless, challenges persist, especially when transcriptions contain noise, leading to significant drops in performance when Natural Language Processing (NLP) models are applied. Named Entity Recognition (NER), an essential clinical task, is particularly affected by such noise, often termed the ASR-NLP gap. Prior works have primarily studied ASR's efficiency in clean recordings, leaving a research gap concerning the performance in noisy environments. This paper introduces a novel dataset, BioASR-NER, designed to bridge the ASR-NLP gap in the biomedical domain, focusing on extracting adverse drug reactions and mentions of entities from the Brief Test of Adult Cognition by Telephone (BTACT) exam. Our dataset offers a comprehensive collection of almost 2,000 clean and noisy recordings. In addressing the noise challenge, we present an innovative transcript-cleaning method using GPT4, investigating both zero-shot and few-shot methodologies. Our study further delves into an error analysis, shedding light on the types of errors in transcription software, corrections by GPT4, and the challenges GPT4 faces. This paper aims to foster improved understanding and potential solutions for the ASR-NLP gap, ultimately supporting enhanced healthcare documentation practices.
翻译:自动语音识别(ASR)技术是将口语转录为文本的基础,在临床领域具有广泛应用,包括简化医学转录流程及与电子健康记录(EHR)系统集成。然而,挑战依然存在,尤其是当转录内容包含噪声时,自然语言处理(NLP)模型的应用性能会显著下降。命名实体识别(NER)作为关键临床任务,尤其易受此类噪声影响,这种现象常被称为ASR-NLP鸿沟。现有研究主要关注ASR在清晰录音中的效率,而针对噪声环境下的性能表现存在研究空白。本文提出了名为BioASR-NER的新型数据集,旨在弥合生物医学领域的ASR-NLP鸿沟,重点从成人认知电话简要测试(BTACT)中提取药物不良反应及实体提及信息。该数据集全面收录了近2000份清晰与带噪录音样本。为应对噪声挑战,我们采用GPT4提出创新性转录文本清洗方法,探索了零样本与少样本学习策略。本研究进一步开展错误分析,揭示了转录软件中的错误类型、GPT4的修正效果及其面临的挑战。本文旨在促进对ASR-NLP鸿沟的深入理解及潜在解决方案的探索,最终助力改善医疗文档实践流程。