PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target script text only in an ASR transcript, or sound unnatural to native listeners. We introduce INSV (Intelligibility, Naturalness, Script fidelity, and Verification), a reporting framework that separates these cases. This paper reports INSV-A, the automated screening subset: synthesis completion, ASR WER/CER, transcript Script Fidelity Rate, and audio language identification. Native MOS and phonetic annotation are specified but not claimed in this release. We instantiate INSV-A as PashtoTTS-Bench, a dated benchmark for Pashto TTS. The April-May 2026 run evaluates Edge GulNawaz, Edge Latifa, OmniVoice clone, OmniVoice auto, and an Urdu negative control on 200 FLEURS and 200 filtered Common Voice 24 prompts. Under the independent omniASR_CTC_300M_v2, OmniVoice auto has the lowest WER (24.1% FLEURS, 27.4% CV24), followed by Edge GulNawaz (32.8%, 39.5%), Edge Latifa (35.6%, 47.7%), and OmniVoice clone (45.4%, 34.8%). WER below the natural-speech baseline reflects clean synthetic audio and should not be read as better than native speech. Whisper Large V3 returns 0.0% Pashto labels on checked Pashto TTS audio, while MMS-LID-4017 and SpeechBrain VoxLingua107 separate Pashto outputs from the Urdu control. The release provides provider metadata, per-sentence scores, LID audits, failure logs, and scripts for adding systems.

翻译：当文本转语音（TTS）评估仅依赖单一ASR往返词错误率（WER）时，针对低资源非拉丁文字语言的评估可能失效。系统可能无法生成音频、输出邻接语言发音、仅在ASR转录中保留目标文字文本，或语音对母语者而言不自然。我们提出INSV（可懂度、自然度、文字保真度与验证）报告框架，对上述情形进行分层分析。本文报告INSV-A自动化筛查子集，包含：合成完成率、ASR的WER/字符错误率（CER）、转录文字保真度及音频语种识别。母语者MOS评分与音标标注已明确规范，但未在本版本中声明。我们以PashtoTTS-Bench（普什图语TTS时效性基准）实例化INSV-A。2026年4-5月运行评估中，对Edge GulNawaz、Edge Latifa、OmniVoice克隆版、OmniVoice自动版及乌尔都语阴性对照系统，采用200条FLEURS与200条过滤后的Common Voice 24提示进行测试。在独立omniASR_CTC_300M_v2模型下，OmniVoice自动版的WER最低（FLEURS：24.1%，CV24：27.4%），其次为Edge GulNawaz（32.8%，39.5%）、Edge Latifa（35.6%，47.7%）及OmniVoice克隆版（45.4%，34.8%）。WER低于自然语音基线仅反映音频合成纯净度，不应解读为优于母语语音。Whisper Large V3对已校验的普什图语TTS音频返回0.0%普什图语标签，而MMS-LID-4017与SpeechBrain VoxLingua107可有效区分普什图语输出与乌尔都语对照组。本发布提供供应商元数据、逐句得分、语种识别审计、失败日志及系统扩展脚本。