Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI's Whisper, a state-of-the-art service outperforming industry competitors. While many of Whisper's transcriptions were highly accurate, we found that roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences, which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38% of hallucinations include explicit harms such as violence, made up personal information, or false video-based authority. We further provide hypotheses on why hallucinations occur, uncovering potential disparities due to speech type by health status. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases in downstream applications of speech-to-text models.
翻译:语音转文本服务旨在尽可能准确地转录输入的音频。它们在日常生活中扮演着越来越重要的角色,例如在个人语音助手或客户与公司的交互中。我们评估了OpenAI的Whisper,这是一个性能超越行业竞争对手的最先进服务。尽管Whisper的许多转录结果高度准确,但我们发现大约1%的音频转录包含完全幻觉化的短语或句子,这些内容在原始音频中根本不存在。我们对Whisper幻觉内容进行了主题分析,发现38%的幻觉包含明显的危害,例如暴力、编造的个人信息或虚假的基于视频的权威。我们进一步提出了关于幻觉为何发生的原因假设,揭示了因健康状况导致的语音类型可能存在的差异。我们呼吁行业从业者改善Whisper中这些基于语言模型的幻觉,并提高对语音转文本模型下游应用中潜在偏见的认识。