Do Large Language Models (LLMs) genuinely grasp the compositional semantics of events, or do they rely on surface-level probabilistic heuristics? We investigate the Imperfective Paradox, a logical phenomenon where the past progressive aspect entails event realization for activities (e.g., running $\to$ ran) but not for accomplishments (e.g., building $\nrightarrow$ built). We introduce ImperfectiveNLI, a diagnostic dataset designed to probe this distinction across diverse semantic classes. Evaluating state-of-the-art open-weight models, we uncover a pervasive Teleological Bias: models systematically hallucinate completion for goal-oriented events, often overriding explicit textual negation. Representational analyses show that while internal embeddings often distinguish process from result, inference decisions are dominated by strong priors about goal attainment. We further find that prompting-based interventions reduce hallucinated completions but also increase incorrect rejections of valid entailments. Our findings suggest that current LLMs lack structural aspectual awareness, operating as predictive narrative engines rather than faithful logical reasoners.
翻译:大型语言模型(LLMs)是否真正理解事件的组合语义,还是仅仅依赖表层概率启发式方法?我们研究了未完成体悖论这一逻辑现象:过去进行体对于活动类事件(如running $\to$ ran)蕴含事件实现,而对于完成类事件(如building $\nrightarrow$ built)则不蕴含。我们提出了ImperfectiveNLI诊断数据集,旨在探究不同语义类别间的这一区别。通过对先进开源权重模型的评估,我们发现了普遍存在的目的论偏差:模型系统性地对目标导向事件产生完成幻觉,常常覆盖显式的文本否定。表征分析表明,虽然内部嵌入通常能区分过程与结果,但推理决策主要受目标达成的强先验支配。我们进一步发现,基于提示词的干预能减少幻觉性完成,但也会增加对有效蕴含关系的错误拒绝。我们的研究结果表明,当前LLMs缺乏结构性体貌意识,其运作机制更接近预测性叙事引擎而非忠实逻辑推理器。