Recent advances in audio-language models have demonstrated remarkable success on short, segment-level speech tasks. However, real-world applications such as meeting transcription, spoken document understanding, and conversational analysis require robust models capable of processing and reasoning over long-form audio. In this work, we present LongSpeech, a large-scale and scalable benchmark specifically designed to evaluate and advance the capabilities of speech models on long-duration audio. LongSpeech comprises over 100,000 speech segments, each approximately 10 minutes long, with rich annotations for ASR, speech translation, summarization, language detection, speaker counting, content separation, and question answering. We introduce a reproducible pipeline for constructing long-form speech benchmarks from diverse sources, enabling future extensions. Our initial experiments with state-of-the-art models reveal significant performance gaps, with models often specializing in one task at the expense of others and struggling with higher-level reasoning. These findings underscore the challenging nature of our benchmark. Our benchmark will be made publicly available to the research community.
翻译:音频-语言模型的最新进展在短片段级语音任务上取得了显著成功。然而,现实应用如会议转录、口语文档理解和对话分析需要能够处理长格式音频并进行推理的鲁棒模型。本文提出LongSpeech,一个专门用于评估和提升语音模型在长时长音频上能力的大规模可扩展基准。LongSpeech包含超过10万个语音片段,每个片段长约10分钟,并配有丰富的标注,涵盖自动语音识别、语音翻译、摘要生成、语言检测、说话人计数、内容分离和问答任务。我们引入了一个可复现的流程,用于从多样化来源构建长格式语音基准,支持未来的扩展。我们使用最先进模型进行的初步实验揭示了显著的性能差距:模型通常专精于单一任务而牺牲其他任务,并在高层推理方面存在困难。这些发现凸显了我们基准的挑战性。本基准将向研究社区公开提供。