Prior approaches to lead instrument detection primarily analyze mixture audio, limited to coarse classifications and lacking generalization ability. This paper presents a novel approach to lead instrument detection in multitrack music audio by crafting expertly annotated datasets and designing a novel framework that integrates a self-supervised learning model with a track-wise, frame-level attention-based classifier. This attention mechanism dynamically extracts and aggregates track-specific features based on their auditory importance, enabling precise detection across varied instrument types and combinations. Enhanced by track classification and permutation augmentation, our model substantially outperforms existing SVM and CRNN models, showing robustness on unseen instruments and out-of-domain testing. We believe our exploration provides valuable insights for future research on audio content analysis in multitrack music settings.
翻译:先前的主奏乐器检测方法主要分析混合音频,局限于粗略分类且缺乏泛化能力。本文提出了一种新颖的多轨音乐音频主奏乐器检测方法,通过构建专家标注数据集并设计一种创新框架,将自监督学习模型与基于轨级帧级注意力的分类器相结合。该注意力机制根据听觉重要性动态提取并聚合轨道特异性特征,从而实现对不同乐器类型及组合的精确检测。通过轨道分类与排列增强技术的优化,我们的模型在性能上显著超越了现有的支持向量机(SVM)与卷积循环神经网络(CRNN)模型,并在未见乐器及跨域测试中表现出强鲁棒性。我们相信,本研究为多轨音乐场景下的音频内容分析提供了有价值的见解。