This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.
翻译:本报告介绍了 VibeVoice-ASR,一个基于 VibeVoice 构建的通用语音理解框架。该框架旨在解决长音频(例如会议、播客)中持续存在的上下文碎片化和多说话人复杂性等挑战,这些挑战在短语音识别领域近期取得进展后依然存在。与依赖音频分块的传统流水线方法不同,VibeVoice-ASR 支持对长达 60 分钟的音频进行单次处理。它将自动语音识别、说话人日志和时间戳标注统一为单一的端到端生成任务。此外,VibeVoice-ASR 支持超过 50 种语言,无需显式设置语言,并能原生处理语句内及跨语句的语码转换。更进一步,我们引入了一种基于提示的上下文注入机制,允许用户提供定制化的上下文,从而显著提高了对领域特定术语和多音字消歧的准确性。