An Comparative Analysis of Different Pitch and Metrical Grid Encoding Methods in the Task of Sequential Music Generation

Pitch and meter are two fundamental music features for symbolic music generation tasks, where researchers usually choose different encoding methods depending on specific goals. However, the advantages and drawbacks of different encoding methods have not been frequently discussed. This paper presents a integrated analysis of the influence of two low-level feature, pitch and meter, on the performance of a token-based sequential music generation model. First, the commonly used MIDI number encoding and a less used class-octave encoding are compared. Second, an dense intra-bar metric grid is imposed to the encoded sequence as auxiliary features. Different complexity and resolutions of the metric grid are compared. For complexity, the single token approach and the multiple token approach are compared; for grid resolution, 0 (ablation), 1 (bar-level), 4 (downbeat-level) 12, (8th-triplet-level) up to 64 (64th-note-grid-level) are compared; for duration resolution, 4, 8, 12 and 16 subdivisions per beat are compared. All different encodings are tested on separately trained Transformer-XL models for a melody generation task. Regarding distribution similarity of several objective evaluation metrics to the test dataset, results suggest that the class-octave encoding significantly outperforms the taken-for-granted MIDI encoding on pitch-related metrics; finer grids and multiple-token grids improve the rhythmic quality, but also suffer from over-fitting at early training stage. Results display a general phenomenon of over-fitting from two aspects, the pitch embedding space and the test loss of the single-token grid encoding. From a practical perspective, we both demonstrate the feasibility and raise the concern of easy over-fitting problem of using smaller networks and lower embedding dimensions on the generation task. The findings can also contribute to futural models in terms of feature engineering.

翻译：音高与节拍是符号音乐生成任务中的两个基本音乐特征，研究者通常根据特定目标选择不同的编码方法。然而，不同编码方法的优劣尚未得到充分讨论。本文系统分析了音高与节拍这两种低级特征对基于令牌的序列音乐生成模型性能的影响。首先，对比了常用的MIDI数字编码与较少使用的类-八度编码。其次，在编码序列中引入密集的节拍内网格作为辅助特征，并比较了不同复杂度与分辨率的节拍网格。在复杂度方面，比较了单令牌方法与多令牌方法；在网格分辨率方面，比较了0（消融）、1（小节级）、4（强拍级）、12（八分三连音级）直至64（六十四分音符级）的网格；在时值分辨率方面，比较了每拍4、8、12和16种细分。所有编码方法均在单独训练的Transformer-XL模型上测试，用于旋律生成任务。在多项客观评价指标与测试数据集的分布相似性方面，结果表明类-八度编码在音高相关指标上显著优于被普遍认可的MIDI编码；更细的网格与多令牌网格可改善节奏质量，但在训练早期容易出现过拟合。结果从音高嵌入空间与单令牌网格编码的测试损失两个方面揭示了过拟合的普遍现象。从实践角度，我们既证明了在小规模网络与较低嵌入维度下生成任务的可行性，也对其易过拟合的问题提出警示。该发现对未来模型的特征工程具有参考价值。