Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness

Colexification refers to the linguistic phenomenon where a single lexical form is used to convey multiple meanings. By studying cross-lingual colexifications, researchers have gained valuable insights into fields such as psycholinguistics and cognitive sciences [Jackson et al.,2019]. While several multilingual colexification datasets exist, there is untapped potential in using this information to bootstrap datasets across such semantic features. In this paper, we aim to demonstrate how colexifications can be leveraged to create such cross-lingual datasets. We showcase curation procedures which result in a dataset covering 142 languages across 21 language families across the world. The dataset includes ratings of concreteness and affectiveness, mapped with phonemes and phonological features. We further analyze the dataset along different dimensions to demonstrate potential of the proposed procedures in facilitating further interdisciplinary research in psychology, cognitive science, and multilingual natural language processing (NLP). Based on initial investigations, we observe that i) colexifications that are closer in concreteness/affectiveness are more likely to colexify; ii) certain initial/last phonemes are significantly correlated with concreteness/affectiveness intra language families, such as /k/ as the initial phoneme in both Turkic and Tai-Kadai correlated with concreteness, and /p/ in Dravidian and Sino-Tibetan correlated with Valence; iii) the type-to-token ratio (TTR) of phonemes are positively correlated with concreteness across several language families, while the length of phoneme segments are negatively correlated with concreteness; iv) certain phonological features are negatively correlated with concreteness across languages. The dataset is made public online for further research.

翻译：共词化（colexification）指单一词汇形式承载多义的语言现象。通过跨语言共词化研究，学者已在心理语言学和认知科学等领域取得重要进展[Jackson等,2019]。尽管现有多个多语言共词化数据集，但其在引导跨语义特征数据集构建方面的潜力尚未充分挖掘。本文旨在论证如何利用共词化创建跨语言数据集，展示一套数据筛选流程，最终构建覆盖全球21个语系、142种语言的数据集。该数据集包含具体性与情感性评分，并与音位及音系特征实现映射。我们进一步从多维度分析该数据集，以揭示所提方法在促进心理学、认知科学及多语言自然语言处理（NLP）跨学科研究中的潜力。基于初步分析发现：i) 具体性/情感性相近的共词化现象更易发生；ii) 语系内特定首/尾音位与具体性/情感性显著相关（如突厥语系与侗台语系中首音位/k/与具体性相关，达罗毗荼语系与汉藏语系中/p/与效价相关）；iii) 多个语系中音位的类符形符比(TTR)与具体性呈正相关，而音段长度与具体性呈负相关；iv) 跨语言条件下，某些音系特征与具体性呈负相关。本数据集已公开于网络平台，供后续研究使用。