Theoretical linguists have suggested that some languages (e.g., Chinese and Japanese) are "cooler" than other languages based on the observation that the intended meaning of phrases in these languages depends more on their contexts. As a result, many expressions in these languages are shortened, and their meaning is inferred from the context. In this paper, we focus on the omission of the plurality and definiteness markers in Chinese noun phrases (NPs) to investigate the predictability of their intended meaning given the contexts. To this end, we built a corpus of Chinese NPs, each of which is accompanied by its corresponding context, and by labels indicating its singularity/plurality and definiteness/indefiniteness. We carried out corpus assessments and analyses. The results suggest that Chinese speakers indeed drop plurality and definiteness markers very frequently. Building on the corpus, we train a bank of computational models using both classic machine learning models and state-of-the-art pre-trained language models to predict the plurality and definiteness of each NP. We report on the performance of these models and analyse their behaviours.
翻译:理论语言学家指出,基于某些语言(如中文和日语)中短语的预期含义更依赖语境这一观察,这些语言比其他语言"更冷"。因此,这些语言中的许多表达被简化,其含义通过语境推断得出。本文聚焦中文名词短语中复数与定指性标记的省略现象,探究其预期含义在给定语境中的可预测性。为此,我们构建了一个中文名词短语语料库,每个短语附有对应语境,并标注其单复数与定指/非定指属性。通过语料评估与分析,结果表明中文使用者确实非常频繁地省略复数与定指性标记。基于该语料库,我们训练了一系列计算模型,采用经典机器学习模型与最先进的预训练语言模型来预测每个名词短语的复数与定指属性。我们报告了这些模型的性能表现并分析了其行为特征。