Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a certain syntactic structure, double-object datives, is predominantly associated with Germanic verbs rather than Latinate verbs. From the perspective of language acquisition, however, such etymology-based generalizations raise learnability concerns, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our model also uncovered previously unrecognized features of the quasi-etymological clusters. Taken together with prior results from Japanese, our findings indicate that the proposed method provides a general, cross-linguistic approach to discovering etymological structure from phonotactic cues in the lexicon.
翻译:跨语言研究表明,本族词汇与借词遵循不同的音系规则。以英语为例,源自日耳曼语与拉丁语的词汇呈现不同的重音模式,且特定句法结构(如双宾语与格结构)主要与日耳曼语源动词而非拉丁语源动词相关联。然而,从语言习得视角看,这类基于词源学的归纳会引发可学性问题,因为普通语言学习者通常无法获取词汇的历史来源信息。本研究通过计算建模证明:英语词汇中日耳曼语与拉丁语的区分可以通过单个词语的音系配列信息习得。具体而言,我们对语料库提取的词汇进行无监督聚类,所得词簇与词源学分类高度吻合。模型发现的词簇同时复现了既有文献中关于对应词源类别的多种语言学规律。此外,我们的模型还揭示了这些准词源簇中先前未被识别的特征。结合先前日语研究的结果,本研究表明所提出的方法为通过词汇音系线索发现词源结构提供了一种普适的跨语言研究路径。