This work studies the semantic representations learned by BERT for compounds, that is, expressions such as sunlight or bodyguard. We build on recent studies that explore semantic information in Transformers at the word level and test whether BERT aligns with human semantic intuitions when dealing with expressions (e.g., sunlight) whose overall meaning depends -- to a various extent -- on the semantics of the constituent words (sun, light). We leverage a dataset that includes human judgments on two psycholinguistic measures of compound semantic analysis: lexeme meaning dominance (LMD; quantifying the weight of each constituent toward the compound meaning) and semantic transparency (ST; evaluating the extent to which the compound meaning is recoverable from the constituents' semantics). We show that BERT-based measures moderately align with human intuitions, especially when using contextualized representations, and that LMD is overall more predictable than ST. Contrary to the results reported for 'standard' words, higher, more contextualized layers are the best at representing compound meaning. These findings shed new light on the abilities of BERT in dealing with fine-grained semantic phenomena. Moreover, they can provide insights into how speakers represent compounds.
翻译:本研究探究BERT对复合词(如sunlight或bodyguard这类表达式)所学习的语义表征。我们基于近期在词层面探索Transformer中语义信息的研究,检验BERT在处理整体含义在不同程度上依赖于组成词(如sun、light)语义的表达式(如sunlight)时,是否与人类语义直觉一致。我们利用包含人类对两类复合词语义分析心理语言学指标判断的数据集:词素意义主导性(LMD;量化每个组成词对复合词含义的权重)和语义透明度(ST;评估从组成词义中还原复合词含义的程度)。结果表明,基于BERT的度量与人类直觉存在中等程度的一致性,尤其在采用语境化表征时,且LMD整体上比ST更具可预测性。与针对"标准"词所报告的结果相反,更高层、更语境化的层在表征复合词含义方面表现最佳。这些发现为BERT处理细粒度语义现象的能力提供了新视角,同时也为解析说话者如何表征复合词提供了洞见。