In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns. In particular, flattening the codebooks represents the highest quality decoding strategy, while being notoriously slow. To this end, we propose a novel stack-and-delay style of decoding strategy to improve upon the flat pattern decoding where generation speed is four times faster as opposed to vanilla flat decoding. This brings the inference time close to that of the delay decoding strategy, and allows for faster inference on GPU for small batch sizes. For the same inference efficiency budget as the delay pattern, we show that the proposed approach performs better in objective evaluations, almost closing the gap with the flat pattern in terms of quality. The results are corroborated by subjective evaluations which show that samples generated by the new model are slightly more often preferred to samples generated by the competing model given the same text prompts.
翻译:在基于语言建模的音乐生成中,生成的波形由层级化的标记堆栈序列表示,可根据码本模式以自回归或并行方式解码。其中,码本展平策略代表最高质量的解码方式,但存在显著的速度瓶颈。为此,我们提出一种新颖的堆栈与延迟式解码策略,以改进展平模式解码:相较于原始展平解码,其生成速度提升四倍,推理时间接近延迟解码策略,并能在小批量场景下实现GPU快速推理。实验表明,在与延迟模式相同的推理效率预算下,该方法在客观评价中表现更优,几乎弥合了与展平模式的质量差距。主观评价结果进一步证实:对于相同的文本提示,新模型生成的样本相较于对比模型生成的样本,获得了更频繁的偏好选择。