Large Language Models (LLMs) perform well on many reasoning benchmarks, yet existing evaluations rarely assess their ability to distinguish between meaningful semantic relations and genuine unrelatedness. We introduce CORE (Comprehensive Ontological Relation Evaluation), a dataset of 225K multiple-choice questions spanning 74 disciplines, together with a general-domain open-source benchmark of 203 rigorously validated questions (Cohen's Kappa = 1.0) covering 24 semantic relation types with equal representation of unrelated pairs. A human baseline from 1,000+ participants achieves 92.6% accuracy (95.1% on unrelated pairs). In contrast, 29 state-of-the-art LLMs achieve 48.25-70.9% overall accuracy, with near-ceiling performance on related pairs (86.5-100%) but severe degradation on unrelated pairs (0-41.35%), despite assigning similar confidence (92-94%). Expected Calibration Error increases 2-4x on unrelated pairs, and a mean semantic collapse rate of 37.6% indicates systematic generation of spurious relations. On the CORE 225K MCQs dataset, accuracy further drops to approximately 2%, highlighting substantial challenges in domain-specific semantic reasoning. We identify unrelatedness reasoning as a critical, under-evaluated frontier for LLM evaluation and safety.
翻译:大语言模型(LLMs)在众多推理基准测试中表现优异,然而现有评估方法很少考察其区分有意义语义关系与真正无关关系的能力。本文提出CORE(综合性本体关系评估)数据集,包含涵盖74个学科的22.5万道多项选择题,同时构建了一个通用领域的开源基准测试,包含203个经过严格验证的问题(科恩卡帕系数=1.0),覆盖24种语义关系类型且无关关系对占比均衡。基于1000余名参与者建立的人类基线准确率达到92.6%(无关关系对准确率95.1%)。相比之下,29个前沿大语言模型的总体准确率仅为48.25-70.9%,虽然在相关关系对上表现接近天花板水平(86.5-100%),但在无关关系对上出现严重性能退化(0-41.35%),尽管其置信度评分相近(92-94%)。无关关系对的预期校准误差增长2-4倍,平均语义坍缩率达37.6%,表明模型系统性地生成虚假关系。在CORE 22.5万道多选题数据集上,准确率进一步下降至约2%,凸显了领域特定语义推理面临的重大挑战。本研究指出无关关系推理是大语言模型评估与安全领域一个关键且尚未充分评估的前沿方向。