Autoregressive speech synthesis often adopts a left-to-right order, yet generation order is a modelling choice. We investigate decoding order through masked diffusion framework, which progressively unmasks positions and allows arbitrary decoding orders during training and inference. By interpolating between identity and random permutations, we show that randomness in decoding order affects speech quality. We further compare fixed strategies, such as \texttt{l2r} and \texttt{r2l} with adaptive ones, such as Top-$K$, finding that fixed-order decoding, including the dominating left-to-right approach, is suboptimal, while adaptive decoding yields better performance. Finally, since masked diffusion requires discrete inputs, we quantise acoustic representations and find that even 1-bit quantisation can support reasonably high-quality speech.
翻译:自回归语音合成通常采用从左到右的顺序,但生成顺序实际上是一种建模选择。我们通过掩码扩散框架研究解码顺序,该框架逐步解除位置掩码,并允许在训练和推理过程中采用任意解码顺序。通过在恒等排列与随机排列之间进行插值,我们发现解码顺序的随机性会影响语音质量。我们进一步比较了固定策略(如 \texttt{l2r} 和 \texttt{r2l})与自适应策略(如 Top-$K$),发现包括主流的从左到右方法在内的固定顺序解码并非最优,而自适应解码能带来更好的性能。最后,由于掩码扩散需要离散输入,我们对声学表示进行量化,发现即使是 1 比特量化也能支持相当高质量的语音合成。