Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.
翻译:将音乐分词以适配语言模型的通用框架是一项引人注目的挑战,尤其是考虑到音乐可呈现的多样化符号结构(如序列、网格和图结构)。迄今为止,大多数方法将符号音乐分词为音乐事件序列,例如起始点、音高、时移或复合音符事件。这种策略直观且已在基于Transformer的模型中证明有效,但它隐含地处理了音乐时间规律性:单个分词可能跨越不同时长,导致非均质的时间推进。本文转而探讨另一种分词方式的可行性,即以均匀长度的音乐步长(如节拍)作为基本单位。具体而言,我们将同一时间步内所有相同音高的事件编码为一个分词,并显式地按时间步对分词进行分组,这类似于钢琴卷帘表示的稀疏编码。我们在音乐续写和伴奏生成任务中评估所提出的分词方法,并与主流基于事件的方法进行比较。结果表明,该方法提升了音乐质量和结构连贯性,而额外分析证实了所提分词方法具有更高效率且能更有效地捕捉长程模式。