Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored. In this work, we evaluate such disentangled representations in a set of music audio models for controllable generation using a probing-based framework that goes beyond standard downstream tasks. The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures. We further isolate specific strategies to analyze their effect. Our analysis spans four key axes: informativeness, equivariance, invariance, and disentanglement, which are assessed across datasets, tasks, and controlled transformations. Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of producing truly disentangled representations, and prompting a re-examination of how controllability is approached in music generation.
翻译:近期音乐生成方法常依赖解耦表征(通常标记为结构与音色,或局部与全局)以实现可控合成。然而,这些嵌入表征的内在特性仍未得到充分探索。本研究通过超越标准下游任务的探针框架,在一系列音乐音频模型中评估此类解耦表征的可控生成能力。所选模型反映了多样化的无监督解耦策略,包括归纳偏置、数据增强、对抗性目标及分阶段训练流程。我们进一步分离特定策略以分析其影响。分析涵盖四个关键维度:信息量、等变性、不变性与解耦度,这些维度在数据集、任务及受控变换中均得到评估。研究结果表明,嵌入表征的预期语义与实际语义之间存在不一致性,暗示当前策略尚未产生真正解耦的表征,并促使我们重新审视音乐生成中可控性的实现路径。