Combining multiple audio features can improve the performance of music tagging, but common deep learning-based feature fusion methods often lack interpretability. To address this problem, we propose a Genetic Programming (GP) pipeline that automatically evolves composite features by mathematically combining base music features, thereby capturing synergistic interactions while preserving interpretability. This approach provides representational benefits similar to deep feature fusion without sacrificing interpretability. Experiments on the MTG-Jamendo and GTZAN datasets demonstrate consistent improvements compared to state-of-the-art systems across base feature sets at different abstraction levels. It should be noted that most of the performance gains are noticed within the first few hundred GP evaluations, indicating that effective feature combinations can be identified under modest search budgets. The top evolved expressions include linear, nonlinear, and conditional forms, with various low-complexity solutions at top performance aligned with parsimony pressure to prefer simpler expressions. Analyzing these composite features further reveals which interactions and transformations tend to be beneficial for tagging, offering insights that remain opaque in black-box deep models.
翻译:结合多种音频特征可以提升音乐标记的性能,但基于深度学习的常见特征融合方法往往缺乏可解释性。针对这一问题,我们提出一种遗传规划(GP)管道,通过数学组合基础音乐特征来自动演化复合特征,从而在保持可解释性的同时捕获协同交互作用。该方法在提供与深度特征融合类似的表征优势的同时,不会牺牲可解释性。在MTG-Jamendo和GTZAN数据集上的实验表明,与现有最先进系统相比,该方法在不同抽象层次的基础特征集上均实现了一致性改进。值得注意的是,大部分性能提升在最初的几百次GP评估内即可显现,表明在适度的搜索预算下即可识别有效的特征组合。最高效的演化表达式涵盖线性、非线性和条件形式,其中多种低复杂度的顶级性能解决方案与简约性压力一致,倾向于更简洁的表达式。进一步分析这些复合特征还能揭示哪些交互作用和变换通常对标记有益,而这些洞察在黑箱深度模型中仍晦涩难解。