Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.
翻译:去除中间表示和单独训练的解码阶段已成为生成建模的重要方向。然而在文本转语音中,高质量系统仍普遍通过中间声学表示进行波形合成。本文提出BareWave,一个完全原生波形的框架,用于流匹配TTS中的直接文本到波形生成。我们考虑该场景提出了三个训练挑战:原始波形建模缺乏强大的预训练表示支架,不同训练阶段需要不同的噪声调度,数据空间感知目标无法自动共享速度空间流目标的时间结构。因此,直接波形训练难以高效优化,难以通过固定方案推向强最终工作点,也难以集成有效的感知精炼。基于此观点,我们开发了一个直接文本到波形训练框架,结合训练时表示对齐、分阶段噪声调度和速度感知对齐(VAPA),同时保留单一的测试时无预训练组件的原生波形推理路径。零样本语音克隆实验表明,在完全原生波形推理路径下可实现强大的清晰度、说话人相似度和自然度,支持原生波形的流匹配TTS作为实用方向。含音频演示的项目页面见https://barewave.github.io/。