This paper targets the perceptual task of separating the different interacting voices, i.e., monophonic melodic streams, in a polyphonic musical piece. We target symbolic music, where notes are explicitly encoded, and model this task as a Multi-Trajectory Tracking (MTT) problem from discrete observations, i.e., notes in a pitch-time space. Our approach builds a graph from a musical piece, by creating one node for every note, and separates the melodic trajectories by predicting a link between two notes if they are consecutive in the same voice/stream. This kind of local, greedy prediction is made possible by node embeddings created by a heterogeneous graph neural network that can capture inter- and intra-trajectory information. Furthermore, we propose a new regularization loss that encourages the output to respect the MTT premise of at most one incoming and one outgoing link for every node, favouring monophonic (voice) trajectories; this loss function might also be useful in other general MTT scenarios. Our approach does not use domain-specific heuristics, is scalable to longer sequences and a higher number of voices, and can handle complex cases such as voice inversions and overlaps. We reach new state-of-the-art results for the voice separation task in classical music of different styles.
翻译:本文针对多声部音乐作品中分离不同交互声部(即单音旋律流)的感知任务。我们聚焦于符号音乐(音符被显式编码),并将该任务建模为基于离散观测(即音高-时间空间中的音符)的多轨迹追踪(MTT)问题。我们的方法通过为每个音符创建节点,从音乐作品构建图,并通过预测两个音符是否属于同一声部/流的连续音符来分离旋律轨迹。这种局部贪心预测通过异构图神经网络生成的节点嵌入得以实现,该网络能够捕获轨迹间与轨迹内信息。此外,我们提出一种新的正则化损失函数,鼓励输出遵循MTT前提条件(每个节点最多一个入链和一个出链),从而促进单音(声部)轨迹的形成;该损失函数也可能适用于其他通用MTT场景。我们的方法无需领域特定启发式规则,可扩展至更长序列及更多声部,并能处理声部倒置与重叠等复杂情况。我们在不同风格的古典音乐声部分离任务中取得了新的最优结果。