We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth" - utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a benchmark dataset of over 1,200+ meticulously curated and diverse examples across English, Mandarin, Spanish, French, Japanese, and Korean. Each example underwent careful expert review to verify its Drivelological characteristics, involving multiple rounds of discussion and adjudication to address disagreements. Using this dataset, we evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss implied rhetorical functions altogether. These findings highlight a deep representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.
翻译:我们提出“废话学”这一独特的语言现象,其特点是“具有深度的无意义表达”——即句法连贯但在语用层面呈现矛盾、情感负载或修辞颠覆的话语。此类表达虽看似表层无意义,却隐含着需要语境推理、道德判断或情感解读的深层含义。研究发现,当前的大语言模型虽然在众多自然语言处理任务中表现优异,却始终无法把握废话学文本的层次化语义。为探究此问题,我们构建了一个包含1200余条经精心筛选的跨语言示例的基准数据集,涵盖英语、汉语、西班牙语、法语、日语和韩语。每个示例均经过专家严格评审以确认其废话学特征,并通过多轮讨论与裁定解决分歧。基于该数据集,我们对多种大语言模型进行了分类、生成和推理任务的评估。结果表明大语言模型存在明显局限:模型常将废话学与浅层无意义表达混淆,产生不连贯的论证,或完全忽略隐含的修辞功能。这些发现揭示了大语言模型在语用理解层面存在深层的表征缺陷,并对“统计流畅性等同于认知理解”的假设提出了挑战。我们公开数据集与代码,以促进超越表层连贯性的语言深度建模研究。