We present PitchFlower, a flow-based neural audio codec with explicit pitch controllability. Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFiGAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.
翻译:本文提出PitchFlower,一种具有显式音高可控性的基于流的神经音频编解码器。我们的方法通过一种简单的扰动实现解耦:在训练过程中,F0轮廓被展平并随机偏移,而真实的F0作为条件输入。向量量化瓶颈防止音高信息恢复,基于流的解码器则生成高质量音频。实验表明,PitchFlower在显著更高的音频质量下实现了比WORLD更精确的音高控制,同时在保持可比音质的前提下,其可控性优于SiFiGAN。除音高外,该框架为解耦其他语音属性提供了一条简单且可扩展的路径。