This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.
翻译:本研究提出了一种面向科研人员与新闻工作者的语音转文本系统“Pisets”,该系统采用三组件架构,旨在提升语音识别准确率,同时减少Whisper模型常见的误识别与幻觉问题。该架构包含基于Wav2Vec2的初级识别、通过音频谱图变换器(Audio Spectrogram Transformer, AST)进行的误报过滤,以及最终经由Whisper完成的语音识别。课程学习方法的实施与多样化俄语语音语料库的运用,显著提升了系统效能。此外,本研究引入了先进的不确定性建模技术,进一步改善了转录质量。相较于WhisperX及标准Whisper模型,所提出的方法能够确保在不同声学条件下对长时音频数据进行鲁棒转录。Pisets系统的源代码已在GitHub平台公开:https://github.com/bond005/pisets。