When used with deep learning, the symbolic music modality is often coupled with language model architectures. To do so, the music needs to be tokenized, i.e. converted into a sequence of discrete tokens. This can be achieved by different approaches, as music can be composed of simultaneous tracks, of simultaneous notes with several attributes. Until now, the proposed tokenizations rely on small vocabularies of tokens describing the note attributes and time events, resulting in fairly long token sequences, and a sub-optimal use of the embedding space of language models. Recent research has put efforts on reducing the overall sequence length by merging embeddings or combining tokens. In this paper, we show that Byte Pair Encoding, a compression technique widely used for natural language, significantly decreases the sequence length while increasing the vocabulary size. By doing so, we leverage the embedding capabilities of such models with more expressive tokens, resulting in both better results and faster inference in generation and classification tasks. The source code is shared on Github, along with a companion website. Finally, BPE is directly implemented in MidiTok, allowing the reader to easily benefit from this method.
翻译:当与深度学习结合使用时,符号音乐模态常与语言模型架构耦合。为此,音乐需要进行分词,即转换为离散标记序列。由于音乐可由多轨同步音频、具有多种属性的同步音符构成,因此可以通过不同方法实现这一过程。迄今为止,现有的分词方案依赖于描述音符属性和时间事件的小型标记词汇表,导致标记序列较长,且无法充分利用语言模型的嵌入空间。近期研究致力于通过合并嵌入或组合标记来缩短整体序列长度。本文证明,一种广泛用于自然语言的压缩技术——字节对编码——能在扩展词汇表大小的同时显著缩短序列长度。通过这种方式,我们利用更富表现力的标记充分发挥此类模型的嵌入能力,从而在生成与分类任务中获得更优结果并实现更快的推理速度。源代码已共享至GitHub并配有配套网站。此外,BPE已直接集成至MidiTok框架中,便于研究者轻松应用该方法。