Concept-based interpretability methods like TCAV require clean, well-separated positive and negative examples for each concept. Existing music datasets lack this structure: tags are sparse, noisy, or ill-defined. We introduce ConceptCaps, a dataset of 23k music-caption-audio triplets with explicit labels from a 200-attribute taxonomy. Our pipeline separates semantic modeling from text generation: a VAE learns plausible attribute co-occurrence patterns, a fine-tuned LLM converts attribute lists into professional descriptions, and MusicGen synthesizes corresponding audio. This separation improves coherence and controllability over end-to-end approaches. We validate the dataset through audio-text alignment (CLAP), linguistic quality metrics (BERTScore, MAUVE), and TCAV analysis confirming that concept probes recover musically meaningful patterns. Dataset and code are available online.
翻译:基于概念的可解释性方法(如TCAV)需要为每个概念提供干净、分离良好的正负示例。现有的音乐数据集缺乏这种结构:标签稀疏、有噪声或定义不清。我们提出了ConceptCaps,这是一个包含23k个音乐-描述-音频三元组的数据集,其标签来自一个包含200个属性的分类体系。我们的流程将语义建模与文本生成分离:一个VAE学习合理的属性共现模式,一个微调的LLM将属性列表转换为专业描述,而MusicGen则合成相应的音频。这种分离相较于端到端方法提高了连贯性和可控性。我们通过音频-文本对齐(CLAP)、语言质量指标(BERTScore、MAUVE)以及TCAV分析验证了该数据集,确认概念探针能够恢复具有音乐意义的模式。数据集和代码已在线公开。