Self-supervised learning (SSL) offers a powerful way to learn robust, generalizable representations without labeled data. In music, where labeled data is scarce, existing SSL methods typically use generated supervision and multi-view redundancy to create pretext tasks. However, these approaches often produce entangled representations and lose view-specific information. We propose a novel self-supervised multi-view learning framework for audio designed to incentivize separation between private and shared representation spaces. A case study on audio disentanglement in a controlled setting demonstrates the effectiveness of our method.
翻译:自监督学习(SSL)提供了一种无需标注数据即可学习鲁棒、泛化性强的表征的强大方法。在标注数据稀缺的音乐领域,现有的SSL方法通常利用生成式监督和多视角冗余来构建预训练任务。然而,这些方法往往产生纠缠的表征,并丢失视角特定的信息。我们提出了一种新颖的自监督多视角学习框架,专为音频设计,旨在激励私有表征空间与共享表征空间之间的分离。在受控环境下进行的音频解缠案例研究验证了我们方法的有效性。