BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.

翻译：我们推出BeDiscovER（推理语言模型时代的语篇理解基准评测），这是一套用于评估现代大语言模型语篇层面知识的最新综合性评测体系。BeDiscovER整合了5个公开可用的语篇任务，涵盖语篇词汇、（多）句子及文档三个层级，共计包含52个独立数据集。它既包含语篇解析与时序关系抽取等被广泛研究的任务，也涵盖诸如语篇小品词消歧（例如“just”）等新颖挑战，并汇集了多语言多框架语篇关系分类的共享任务——语篇关系解析与树库构建。我们在BeDiscovER上评估了开源大语言模型（Qwen3系列、DeepSeek-R1）以及前沿模型（如GPT-5-mini），发现当前最先进的模型在时序推理的算术层面表现强劲，但在完整文档推理及某些细微语义与语篇现象（如修辞关系识别）方面仍存在明显不足。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

VIP会员