Speech foundation models and Speech LLMs have advanced speech understanding, yet deployment-oriented model selection is hindered by non-comparable evaluations caused by mismatched post-processing, and by training results that are hard to reproduce across data scales and pipelines. We present SURE, a unified experimentation framework that standardizes prediction formats, normalization, and scoring. SURE evaluates strong systems across paradigms, from conventional pipelines to Speech LLMs, on representative tasks under realistic acoustic and linguistic stressors. Beyond evaluation, SURE introduces an agent-assisted training conversion flow that maps paper and code into versioned, runnable training pipelines under a unified protocol on matched open-data subsets. Overall, SURE improves comparability and reproducibility for deployment-oriented evaluation.
翻译:语音基础模型与语音大语言模型推动了语音理解的发展,但面向部署的模型选择受限于因后处理不匹配导致的非可比性评估,以及难以跨数据规模和训练流程复现的训练结果。我们提出SURE框架,这是一个统一实验框架,用于标准化预测格式、归一化方法和评分机制。SURE在代表性任务上评估了从传统流程到语音大语言模型等多种范式的强系统,并考虑了真实的声学与语言压力条件。除评估外,SURE引入了一种基于智能体辅助的训练流程转换机制,能够将论文与代码映射为统一协议下、基于匹配开放数据子集的带版本可运行训练流水线。总体而言,SURE提升了面向部署评估的可比性与可复现性。