We present a large-scale comparative study of 242 Latin and Cyrillic-script languages using subword-based methodologies. By constructing 'glottosets' from Wikipedia lexicons, we introduce a framework for simultaneous cross-linguistic comparison via Byte-Pair Encoding (BPE). Our approach utilizes rank-based subword vectors to analyze vocabulary overlap, lexical divergence, and language similarity at scale. Evaluations demonstrate that BPE segmentation aligns with morpheme boundaries 95% better than random baseline across 15 languages (F1 = 0.34 vs 0.15). BPE vocabulary similarity correlates significantly with genetic language relatedness (Mantel r = 0.329, p < 0.001), with Romance languages forming the tightest cluster (mean distance 0.51) and cross-family pairs showing clear separation (0.82). Analysis of 26,939 cross-linguistic homographs reveals that 48.7% receive different segmentations across related languages, with variation correlating to phylogenetic distance. Our results provide quantitative macro-linguistic insights into lexical patterns across typologically diverse languages within a unified analytical framework.
翻译:本研究采用子词方法对242种拉丁字母与西里尔字母语言进行了大规模比较分析。通过从维基百科词典构建"语料集",我们提出了基于字节对编码(BPE)的跨语言同步比较框架。该方法利用基于排序的子词向量,在大规模范围内分析词汇重叠度、词汇分化度及语言相似性。评估结果表明:在15种语言中,BPE切分与语素边界的对齐度较随机基线提升95%(F1值0.34对0.15)。BPE词汇相似度与语言谱系关联性呈显著相关(Mantel检验r=0.329,p<0.001),其中罗曼语族形成最紧密聚类(平均距离0.51),跨语系语言对呈现明显分离(0.82)。对26,939个跨语言同形词的分析显示,48.7%的词汇在亲属语言中获得不同切分,其变异程度与谱系距离相关。本研究结果在统一分析框架内,为类型学多样语言间的词汇模式提供了量化宏观语言学见解。