Computational models of syntax are predominantly text-based. Here we propose that basic syntax can be modeled directly from raw speech in a fully unsupervised way. We focus on one of the most ubiquitous and basic properties of syntax -- concatenation. We introduce spontaneous concatenation: a phenomenon where convolutional neural networks (CNNs) trained on acoustic recordings of individual words start generating outputs with two or even three words concatenated without ever accessing data with multiple words in the input. Additionally, networks trained on two words learn to embed words into novel unobserved word combinations. To our knowledge, this is a previously unreported property of CNNs trained on raw speech in the Generative Adversarial Network setting and has implications both for our understanding of how these architectures learn as well as for modeling syntax and its evolution from raw acoustic inputs.
翻译:计算句法模型主要以文本为基础。本文提出,基本句法可以通过完全无监督的方式直接从原始语音中建模。我们聚焦于句法最普遍且基本的属性之一——拼接。我们引入"自发拼接"现象:在单个单词的声学录音上训练的卷积神经网络(CNN)在从未接触过含多个单词的输入数据的情况下,开始生成包含两个甚至三个单词拼接的输出。此外,在两单词上训练的网络能够将单词嵌入到未曾观察到的单词组合中。据我们所知,这是CNN在生成对抗网络设定下训练于原始语音时此前未被报道的特性,它对理解这些架构的学习方式,以及从原始声学输入建模句法及其演化具有重要意义。