Autodecompose: A generative self-supervised model for semantic decomposition

We introduce Autodecompose, a novel self-supervised generative model that decomposes data into two semantically independent properties: the desired property, which captures a specific aspect of the data (e.g. the voice in an audio signal), and the context property, which aggregates all other information (e.g. the content of the audio signal), without any labels given. Autodecompose uses two complementary augmentations, one that manipulates the context while preserving the desired property and the other that manipulates the desired property while preserving the context. The augmented variants of the data are encoded by two encoders and reconstructed by a decoder. We prove that one of the encoders embeds the desired property while the other embeds the context property. We apply Autodecompose to audio signals to encode sound source (human voice) and content. We pre-trained the model on YouTube and LibriSpeech datasets and fine-tuned in a self-supervised manner without exposing the labels. Our results showed that, using the sound source encoder of pre-trained Autodecompose, a linear classifier achieves F1 score of 97.6\% in recognizing the voice of 30 speakers using only 10 seconds of labeled samples, compared to 95.7\% for supervised models. Additionally, our experiments showed that Autodecompose is robust against overfitting even when a large model is pre-trained on a small dataset. A large Autodecompose model was pre-trained from scratch on 60 seconds of audio from 3 speakers achieved over 98.5\% F1 score in recognizing those three speakers in other unseen utterances. We finally show that the context encoder embeds information about the content of the speech and ignores the sound source information. Our sample code for training the model, as well as examples for using the pre-trained models are available here: \url{https://github.com/rezabonyadi/autodecompose}

翻译：我们提出Autodecompose，一种新颖的自监督生成模型，能够将数据分解为两个语义独立的属性：目标属性（捕获数据的特定方面，如音频信号中的声音）和上下文属性（聚合所有其他信息，如音频信号的内容），且无需任何标签。Autodecompose使用两种互补的数据增强方法：一种在保持目标属性的同时操控上下文，另一种在保持上下文的同时操控目标属性。数据增强后的变体由两个编码器编码，并由一个解码器重构。我们证明其中一个编码器嵌入目标属性，另一个编码器嵌入上下文属性。我们将Autodecompose应用于音频信号以编码声源（人声）和内容。模型在YouTube和LibriSpeech数据集上预训练，并以自监督方式微调，未暴露标签。结果表明，使用预训练Autodecompose的声源编码器，线性分类器仅需10秒标注样本即可在识别30个说话者的语音中达到97.6%的F1分数，而监督模型为95.7%。此外，实验表明，即使在小数据集上预训练大模型，Autodecompose对过拟合也具有鲁棒性：一个从3个说话者的60秒音频从头训练的大型Autodecompose模型，在识别这些说话者的其他未见话语时达到超过98.5%的F1分数。最终，我们证明上下文编码器嵌入语音内容信息，并忽略声源信息。模型训练示例代码及预训练模型使用示例请见：\url{https://github.com/rezabonyadi/autodecompose}

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【AAAI2020】知识图谱的生成式对抗零样本关系学习，Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs

专知会员服务

64+阅读 · 2020年1月11日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日