This study investigates how well computational embeddings align with human semantic judgments in the processing of English compound words. We compare static word vectors (GloVe) and contextualized embeddings (BERT) against human ratings of lexeme meaning dominance (LMD) and semantic transparency (ST) drawn from a psycholinguistic dataset. Using measures of association strength (Edinburgh Associative Thesaurus), frequency (BNC), and predictability (LaDEC), we compute embedding-derived LMD and ST metrics and assess their relationships with human judgments via Spearmans correlation and regression analyses. Our results show that BERT embeddings better capture compositional semantics than GloVe, and that predictability ratings are strong predictors of semantic transparency in both human and model data. These findings advance computational psycholinguistics by clarifying the factors that drive compound word processing and offering insights into embedding-based semantic modeling.
翻译:本研究探讨了在英语复合词处理过程中,计算嵌入向量与人类语义判断之间的对齐程度。我们比较了静态词向量(GloVe)和上下文嵌入(BERT)与来自心理语言学数据集的词位意义主导性(LMD)和语义透明度(ST)的人类评分。通过使用关联强度(爱丁堡联想同义词库)、频率(英国国家语料库)和可预测性(LaDEC)的度量,我们计算了嵌入衍生的LMD和ST指标,并通过斯皮尔曼相关性和回归分析评估了它们与人类判断的关系。我们的结果表明,BERT嵌入比GloVe更好地捕捉了组合语义,且可预测性评分在人类和模型数据中都是语义透明度的强预测因子。这些发现通过阐明驱动复合词处理的因素并为基于嵌入的语义建模提供见解,推动了计算心理语言学的发展。