We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining duration-guided and alignment-free models through inference-time vector-field fusion; (2) robust cloning using a sequence of speech-prompt embeddings in a FLUX-based decoder, preserving speaker traits across languages without prompt transcripts; and (3) a modified PeriodWave vocoder with super-resolution to 48 kHz. On cross-lingual in-the-wild data, PFluxTTS clearly outperforms F5-TTS, FishSpeech, and SparkTTS, matches ChatterBox in naturalness (MOS 4.11) while achieving 23% lower WER (6.9% vs. 9.0%), and surpasses ElevenLabs in speaker similarity (+0.32 SMOS). The system remains robust in challenging scenarios where most open-source models fail, while requiring only short reference audio and no extra training. Audio demos are available at https://braskai.github.io/pfluxtts/
翻译:本文提出PFluxTTS,一种混合式文本转语音系统,旨在解决流匹配语音合成中的三个关键问题:稳定性与自然度的权衡、跨语言语音克隆能力弱以及低速率梅尔特征导致的音频质量受限。我们的主要贡献包括:(1)通过推理时向量场融合,结合时长引导模型与无对齐模型的双解码器架构;(2)在基于FLUX的解码器中采用语音提示嵌入序列实现鲁棒克隆,无需提示文本转录即可跨语言保持说话人特征;(3)配备超分辨率至48 kHz的改进型PeriodWave声码器。在跨语言真实场景数据测试中,PFluxTTS显著优于F5-TTS、FishSpeech和SparkTTS,其自然度与ChatterBox相当(平均意见分4.11)同时词错误率降低23%(6.9%对比9.0%),说话人相似度超越ElevenLabs(+0.32 SMOS)。该系统在多数开源模型失效的挑战性场景中仍保持鲁棒性,且仅需短参考音频且无需额外训练。音频示例请访问:https://braskai.github.io/pfluxtts/