Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.
翻译:将音乐分词以适配语言模型的通用框架是一项具有挑战性的任务,尤其是考虑到音乐可以呈现多种符号结构(如序列、网格和图)。迄今为止,大多数方法将符号音乐分词为一系列音乐事件,例如起始点、音高、时间偏移或复合音符事件。这种策略直观且已在基于Transformer的模型中证明有效,但它隐式地处理了音乐时间的规律性:各个词元可能跨越不同的持续时间,导致时间推进不均匀。本文转而探讨另一种分词的可能性,即采用均匀长度的音乐步长(如节拍)作为基本单位。具体而言,我们将同一时间步内所有具有相同音高的事件编码为一个词元,并显式地按时间步对词元进行分组,这类似于对钢琴卷帘表示进行稀疏编码。我们在音乐续写和伴奏生成任务上评估了所提出的分词方法,并与主流事件基方法进行了比较。结果表明,所提方法在音乐质量和结构连贯性上有所提升,而进一步分析证实,该分词方法具有更高的效率,并能更有效地捕捉长程模式。