Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an evaluation framework for measuring user-perceived usability of speech translation outputs in real-world settings. Ouvia focuses on one-to-one communication: an English speaker needs to convey a request to a Portuguese speaker, and the message is automatically translated. Through a custom web app and multi-phase study design, we collect more than 1,750 such interactions in healthcare and everyday situations, mediated by four ST systems, involving speakers from three English dialects and two genders. We find that modern ST serves people only to a limited extent -- only around half of interactions are rated as usable -- with significant gaps in reported usability across demographic groups. Moreover, among quality metrics, we find that QA-based evaluation is a substantially stronger predictor of real-world usability than standard approaches. Together, these findings stress the importance of situated, user-centered evaluation frameworks that go beyond holistic quality scores and attend to who the technology serves -- and how well.
翻译:语音翻译(ST)正日益广泛应用于各类用户场景,但其评估主要侧重于脱离情境的测试平台与整体质量,而非终端用户的实际通信需求。我们提出Ouvia评估框架,用于在真实场景中度量用户对语音翻译输出的感知可用性。Ouvia聚焦于一对一通信场景:英语使用者需向葡萄牙语使用者传达请求,消息通过自动翻译完成。通过定制化网页应用与多阶段研究设计,我们在医疗保健及日常情境中收集了由四种ST系统中介的1750余次交互,涉及三种英语方言及不同性别使用者。研究发现,当代ST技术对用户的服务能力有限——仅约半数交互被评为可用,且不同人口统计群体的可用性报告存在显著差异。在质量指标方面,QA评估对真实场景可用性的预测能力显著优于传统方法。这些发现共同强调了情境化、用户中心化评估框架的重要性——该框架需超越整体质量评分,聚焦技术服务的对象及其服务效果。