Text-to-Speech (TTS) is inherently a "one-to-many" mapping characterized by intrinsic uncertainty, yet current paradigms often oversimplify it into a deterministic regression task. While continuous-valued autoregressive (AR) models have recently emerged as a promising alternative to discrete codec-based approaches, they typically rely on a fixed-variance prior, fundamentally constraining generation to a static point estimate that ignores the dynamic variability of natural speech. To bridge this gap, we propose BELLE (Bayesian evidential learning with language modelling), a framework that shifts from deterministic prediction to principled Bayesian inference without increasing model parameters or inference latency. By modeling the acoustic target as a Normal-Inverse-Gamma distribution, BELLE captures data-dependent aleatoric uncertainty. To enable accurate variance estimation on standard single-reference datasets, we introduce a "one-to-many" training strategy that leverages synthetic samples as a statistical support set, allowing the model to learn robust distributional properties rather than merely imitating teacher artifacts. Experiments demonstrate that BELLE, trained on only ~5k hours of data, outperforms leading open-source models trained on 50k hours (achieving a 25.8% relative WER reduction) and naturally supports high-quality streaming generation. Audio samples are available at https://belletts.github.io/Belle/.
翻译:文本到语音(TTS)本质上是一种具有内在不确定性的"一对多"映射,但当前范式常将其过度简化为确定性回归任务。尽管连续值自回归模型最近已成为基于离散编解码器方法的有前景替代方案,但它们通常依赖于固定方差先验,这从根本上将生成限制在静态点估计上,忽略了自然语音的动态可变性。为弥合这一差距,我们提出BELLE(基于语言建模的贝叶斯证据学习框架),该框架在不增加模型参数或推理延迟的前提下,从确定性预测转向原则性贝叶斯推理。通过将声学目标建模为正态-逆伽马分布,BELLE能够捕捉数据依赖的偶然不确定性。为实现标准单参考数据集上的准确方差估计,我们引入"一对多"训练策略,利用合成样本作为统计支持集,使模型能够学习稳健的分布特性而非单纯模仿教师模型伪影。实验表明,仅使用约5千小时数据训练的BELLE,其性能优于使用5万小时数据训练的主流开源模型(实现相对词错误率降低25.8%),并天然支持高质量流式生成。音频样本可在 https://belletts.github.io/Belle/ 获取。