This paper explores a specific sub-task of cross-modal music retrieval. We consider the delicate task of retrieving a performance or rendition of a musical piece based on a description of its style, expressive character, or emotion from a set of different performances of the same piece. We observe that a general purpose cross-modal system trained to learn a common text-audio embedding space does not yield optimal results for this task. By introducing two changes -- one each to the text encoder and the audio encoder -- we demonstrate improved performance on a dataset of piano performances and associated free-text descriptions. On the text side, we use emotion-enriched word embeddings (EWE) and on the audio side, we extract mid-level perceptual features instead of generic audio embeddings. Our results highlight the effectiveness of mid-level perceptual features learnt from music and emotion enriched word embeddings learnt from emotion-labelled text in capturing musical expression in a cross-modal setting. Additionally, our interpretable mid-level features provide a route for introducing explainability in the retrieval and downstream recommendation processes.
翻译:本文探索跨模态音乐检索中的一个特定子任务。我们聚焦于从同一音乐作品的不同演奏版本中,基于风格、表现特征或情感描述,检索特定演奏或演绎版本的精妙任务。研究发现,旨在学习通用文本-音频联合嵌入空间的跨模态系统并未在此任务中取得最优效果。通过分别对文本编码器和音频编码器引入两项改进——在文本端采用情感增强词嵌入(EWE),在音频端提取中层感知特征替代通用音频嵌入——我们在钢琴演奏数据集及其对应的自由文本描述中验证了性能提升。实验结果表明,从音乐中学习的中层感知特征与基于情感标注文本学习的情感增强词嵌入在捕捉跨模态音乐表现力方面具有显著有效性。此外,我们具有可解释性的中层特征为检索及下游推荐过程提供了引入解释性的可行路径。