Automatic Speech Recognition (ASR) is a technology that converts spoken words into text, facilitating interaction between humans and machines. One of the most common applications of ASR is Speech-To-Text (STT) technology, which simplifies user workflows by transcribing spoken words into text. In the medical field, STT has the potential to significantly reduce the workload of clinicians who rely on typists to transcribe their voice recordings. However, developing an STT model for the medical domain is challenging due to the lack of sufficient speech and text datasets. To address this issue, we propose a medical-domain text correction method that modifies the output text of a general STT system using the Vision Language Pre-training (VLP) method. VLP combines textual and visual information to correct text based on image knowledge. Our extensive experiments demonstrate that the proposed method offers quantitatively and clinically significant improvements in STT performance in the medical field. We further show that multi-modal understanding of image and text information outperforms single-modal understanding using only text information.
翻译:自动语音识别(ASR)是一种将语音转换为文本的技术,能够促进人机交互。ASR最常见的应用之一就是语音转文字(STT)技术,它通过将语音转录为文本来简化用户工作流程。在医疗领域,STT有望显著减轻依赖打字员转录语音记录的临床医生的工作负担。然而,由于缺乏足够的语音和文本数据集,开发面向医疗领域的STT模型具有挑战性。为了解决这一问题,我们提出了一种医疗领域的文本校正方法,该方法利用视觉-语言预训练(VLP)技术对通用STT系统的输出文本进行修改。VLP结合文本与视觉信息,基于图像知识对文本进行校正。大量实验表明,所提出的方法在定量指标和临床意义上均显著提升了医疗领域的STT性能。我们进一步证明,基于图像和文本信息的多模态理解优于仅基于文本信息的单模态理解。