People who stutter (PWS) face systemic exclusion in today's voice-driven society, where access to voice assistants, authentication systems, and remote work tools increasingly depends on fluent speech. Current automatic speech recognition (ASR) systems, trained predominantly on fluent speech, fail to serve millions of PWS worldwide. We present STEAMROLLER, a real time system that transforms stuttered speech into fluent output through a novel multi-stage, multi-agent AI pipeline. Our approach addresses three critical technical challenges: (1) the difficulty of direct speech to speech conversion for disfluent input, (2) semantic distortions introduced during ASR transcription of stuttered speech, and (3) latency constraints for real time communication. STEAMROLLER employs a three stage architecture comprising ASR transcription, multi-agent text repair, and speech synthesis, where our core innovation lies in a collaborative multi-agent framework that iteratively refines transcripts while preserving semantic intent. Experiments on the FluencyBank dataset and a user study demonstrates clear word error rate (WER) reduction and strong user satisfaction. Beyond immediate accessibility benefits, fine tuning ASR on STEAMROLLER repaired speech further yields additional WER improvements, creating a pathway toward inclusive AI ecosystems.
翻译:在当今语音驱动的社会中,口吃者面临着系统性的排斥,因为访问语音助手、身份验证系统和远程工作工具日益依赖流利的语音。当前主要基于流利语音训练的自动语音识别系统无法为全球数百万口吃者提供服务。我们提出了STEAMROLLER,这是一个实时系统,通过新颖的多阶段、多智能体人工智能流水线,将口吃语音转换为流利的输出。我们的方法解决了三个关键的技术挑战:(1) 针对非流利输入的直接语音到语音转换的困难,(2) 口吃语音在ASR转录过程中引入的语义失真,以及(3) 实时通信的延迟约束。STEAMROLLER采用三阶段架构,包括ASR转录、多智能体文本修复和语音合成,其核心创新在于一个协作式多智能体框架,该框架在保留语义意图的同时迭代优化转录文本。在FluencyBank数据集上的实验和一项用户研究表明,该系统显著降低了词错误率,并获得了用户的高度满意。除了直接的辅助功能效益外,基于STEAMROLLER修复后的语音对ASR进行微调还能带来额外的WER改进,从而为构建包容性人工智能生态系统开辟了道路。