Music source separation is focused on extracting distinct sonic elements from composite tracks. Historically, many methods have been grounded in supervised learning, necessitating labeled data, which is occasionally constrained in its diversity. More recent methods have delved into N-shot techniques that utilize one or more audio samples to aid in the separation. However, a challenge with some of these methods is the necessity for an audio query during inference, making them less suited for genres with varied timbres and effects. This paper offers a proof-of-concept for a self-supervised music source separation system that eliminates the need for audio queries at inference time. In the training phase, while it adopts a query-based approach, we introduce a modification by substituting the continuous embedding of query audios with Vector Quantized (VQ) representations. Trained end-to-end with up to N classes as determined by the VQ's codebook size, the model seeks to effectively categorise instrument classes. During inference, the input is partitioned into N sources, with some potentially left unutilized based on the mix's instrument makeup. This methodology suggests an alternative avenue for considering source separation across diverse music genres. We provide examples and additional results online.
翻译:音乐源分离关注从复合音轨中提取不同的声音元素。历史上,许多方法基于监督学习,需要标注数据,而这类数据在多样性上有时受限。较新的方法探索了利用一个或多个音频样本辅助分离的N-shot技术。然而,部分方法面临的挑战是推理时需要音频查询,这使得它们不太适合具有多变音色和效果的音轨类型。本文提出了一个自监督音乐源分离系统的概念验证,该系统在推理时无需音频查询。在训练阶段,虽然采用了基于查询的方法,但我们引入了一项改进:用向量量化(VQ)表示替换查询音频的连续嵌入。模型以VQ码本大小确定的至多N个类别进行端到端训练,旨在有效分类乐器类别。推理时,输入被分割为N个源,其中部分源可能根据混合音轨的乐器构成未被使用。该方法为考虑跨多种音乐风格的源分离提供了一条替代途径。我们在线提供了示例及更多结果。