Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a set of per-instrument latent representations underlying the observed mixture. By manipulating the representations, our model samples mixtures with novel combinations of pitch and timbre of the constituent instruments. We can jointly learn the disentangled pitch-timbre representations and a latent diffusion transformer that reconstructs the mixture conditioned on the set of source-level representations. We evaluate the model using both a simple dataset of isolated chords and a realistic four-part chorales in the style of J.S. Bach, identify the key components for the success of disentanglement, and demonstrate the application of mixture transformation based on source-level attribute manipulation.
翻译:现有关于音高与音色解耦的研究主要集中于单乐器音乐音频,未涵盖多乐器同时存在的情况。为填补这一空白,我们提出DisMix——一种生成式框架,其中音高与音色表征作为模块化构建单元,用于构建声源的旋律与乐器特征,其集合构成观测混合音频背后的一组乐器级潜在表征。通过对这些表征进行操控,我们的模型能够生成由各组成乐器新颖的音高-音色组合构成的混合音频。我们可联合学习解耦的音高-音色表征与一个潜在扩散Transformer,该Transformer在源级表征集合的条件下重建混合音频。我们使用简单的孤立和弦数据集与J.S.巴赫风格的四声部众赞歌现实数据集评估模型,识别解耦成功的关键组件,并展示基于源级属性操控的混合音频转换应用。