In this work, we study a critical research problem regarding the trustworthiness of large language models (LLMs): how LLMs behave when encountering ambiguous narrative text, with a particular focus on Chinese textual ambiguity. We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs, representing multiple possible interpretations. These annotated examples are systematically categorized into 3 main categories and 9 subcategories. Through experiments, we discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans. Specifically, LLMs cannot reliably distinguish ambiguous text from unambiguous text, show overconfidence in interpreting ambiguous text as having a single meaning rather than multiple meanings, and exhibit overthinking when attempting to understand the various possible meanings. Our findings highlight a fundamental limitation in current LLMs that has significant implications for their deployment in real-world applications where linguistic ambiguity is common, calling for improved approaches to handle uncertainty in language understanding. The dataset and code are publicly available at this GitHub repository: https://github.com/ictup/LLM-Chinese-Textual-Disambiguation.
翻译:本研究探讨了大语言模型(LLMs)可信赖性的关键问题:LLMs在遇到含歧义的叙事文本时的行为表现,重点关注中文文本歧义现象。我们通过收集、生成带上下文的歧义句及其对应的消歧配对句构建基准数据集,这些配对句代表多种可能的解读方式。经过系统分类,这些标注样本被划分为3个主类别和9个子类别。实验发现,LLMs在处理歧义时存在显著脆弱性,其行为模式与人类存在本质差异。具体表现为:LLMs无法可靠区分歧义文本与非歧义文本,易将歧义文本过度自信地解读为单一含义而非多义性,且在尝试理解多种潜在含义时表现出过度思考特征。这些发现揭示了当前LLMs的根本性局限,对部署于语言歧义普遍存在的现实应用场景具有重要警示意义,亟需改进语言理解中不确定性处理的方案。相关数据集与代码已开源至GitHub仓库:https://github.com/ictup/LLM-Chinese-Textual-Disambiguation。