We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.
翻译:我们推出BeDiscovER(推理语言模型时代的语篇理解基准评测),这是一套用于评估现代大语言模型语篇层面知识的最新综合性评测体系。BeDiscovER整合了5个公开可用的语篇任务,涵盖语篇词汇、(多)句子及文档三个层级,共计包含52个独立数据集。它既包含语篇解析与时序关系抽取等被广泛研究的任务,也涵盖诸如语篇小品词消歧(例如“just”)等新颖挑战,并汇集了多语言多框架语篇关系分类的共享任务——语篇关系解析与树库构建。我们在BeDiscovER上评估了开源大语言模型(Qwen3系列、DeepSeek-R1)以及前沿模型(如GPT-5-mini),发现当前最先进的模型在时序推理的算术层面表现强劲,但在完整文档推理及某些细微语义与语篇现象(如修辞关系识别)方面仍存在明显不足。