Lexical ambiguity is widespread in language, allowing for the reuse of economical word forms and therefore making language more efficient. If ambiguous words cannot be disambiguated from context, however, this gain in efficiency might make language less clear -- resulting in frequent miscommunication. For a language to be clear and efficiently encoded, we posit that the lexical ambiguity of a word type should correlate with how much information context provides about it, on average. To investigate whether this is the case, we operationalise the lexical ambiguity of a word as the entropy of meanings it can take, and provide two ways to estimate this -- one which requires human annotation (using WordNet), and one which does not (using BERT), making it readily applicable to a large number of languages. We validate these measures by showing that, on six high-resource languages, there are significant Pearson correlations between our BERT-based estimate of ambiguity and the number of synonyms a word has in WordNet (e.g. $\rho = 0.40$ in English). We then test our main hypothesis -- that a word's lexical ambiguity should negatively correlate with its contextual uncertainty -- and find significant correlations on all 18 typologically diverse languages we analyse. This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
翻译:词汇歧义在语言中普遍存在,它允许重复使用经济高效的词形,从而使语言更有效率。然而,如果歧义词无法通过语境消歧,这种效率的提升可能会降低语言的清晰度——导致频繁的误解。为使语言既清晰又编码高效,我们假设一个词型的词汇歧义程度应与其语境平均提供的信息量相关。为探究这一假设是否成立,我们将词汇歧义操作化为一个词所能承载含义的熵,并提供了两种估计方法——一种需要人工标注(使用WordNet),另一种则不需要(使用BERT),使其能便捷地应用于大量语言。我们通过以下发现验证了这些度量方法的有效性:在六种高资源语言中,我们基于BERT的歧义度估计值与一个词在WordNet中的同义词数量存在显著的皮尔逊相关性(例如英语中$\rho = 0.40$)。随后我们检验了核心假设——词汇歧义度应与其语境不确定性呈负相关——并在所分析的18种类型各异的语言中均发现了显著相关性。这表明在存在歧义的情况下,说话者会通过提高语境信息量来进行补偿。