Representing symbolic music with compound tokens, where each token consists of several different sub-tokens representing a distinct musical feature or attribute, offers the advantage of reducing sequence length. While previous research has validated the efficacy of compound tokens in music sequence modeling, predicting all sub-tokens simultaneously can lead to suboptimal results as it may not fully capture the interdependencies between them. We introduce the Nested Music Transformer (NMT), an architecture tailored for decoding compound tokens autoregressively, similar to processing flattened tokens, but with low memory usage. The NMT consists of two transformers: the main decoder that models a sequence of compound tokens and the sub-decoder for modeling sub-tokens of each compound token. The experiment results showed that applying the NMT to compound tokens can enhance the performance in terms of better perplexity in processing various symbolic music datasets and discrete audio tokens from the MAESTRO dataset.
翻译:使用复合令牌表示符号音乐,其中每个令牌由若干代表不同音乐特征或属性的子令牌构成,具有缩短序列长度的优势。尽管先前研究已证实复合令牌在音乐序列建模中的有效性,但同步预测所有子令牌可能导致次优结果,因其可能无法充分捕捉子令牌间的相互依赖关系。本文提出嵌套音乐Transformer(NMT)架构,专为自回归解码复合令牌而设计,其处理方式类似于扁平化令牌,但内存占用更低。NMT包含两个Transformer:主解码器用于建模复合令牌序列,子解码器用于建模每个复合令牌的子令牌。实验结果表明,将NMT应用于复合令牌处理时,在多种符号音乐数据集及MAESTRO数据集的离散音频令牌上均能获得更优的困惑度指标,从而提升模型性能。