Computational models of syntax are predominantly text-based. Here we propose that the most basic syntactic operations can be modeled directly from raw speech in a fully unsupervised way. We focus on one of the most ubiquitous and elementary properties of syntax -- concatenation. We introduce spontaneous concatenation: a phenomenon where convolutional neural networks (CNNs) trained on acoustic recordings of individual words start generating outputs with two or even three words concatenated without ever accessing data with multiple words in the input. We replicate this finding in several independently trained models with different hyperparameters and training data. Additionally, networks trained on two words learn to embed words into novel unobserved word combinations. To our knowledge, this is a previously unreported property of CNNs trained in the ciwGAN/fiwGAN setting on raw speech and has implications both for our understanding of how these architectures learn as well as for modeling syntax and its evolution from raw acoustic inputs.
翻译:句法计算模型主要基于文本。本文提出,最基本的句法操作可以通过完全无监督的方式直接从原始语音中建模。我们聚焦于句法最普遍且基础的特性之一——拼接。我们引入了自发拼接现象:在单词语音录音上训练的卷积神经网络(CNNs),即使从未接触过多词输入数据,也会开始生成包含两个甚至三个单词拼接的输出。我们在多个具有不同超参数和训练数据的独立训练模型中复现了这一发现。此外,在双词数据上训练的网络能够将单词嵌入到未见过的全新词汇组合中。据我们所知,这是在ciwGAN/fiwGAN框架下基于原始语音训练的CNNs中首次报道的特性,这一发现既有助于我们理解此类架构的学习机制,也对从原始声学输入建模句法及其演化过程具有启示意义。