We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.
翻译:我们推出BeDiscovER(推理语言模型时代的话语理解基准),这是一个用于评估现代大语言模型话语层面知识的最新综合测试套件。BeDiscovER汇集了5个公开可用的话语任务,涵盖话语词汇、(多)句子及文档层级,共计包含52个独立数据集。它既涵盖话语解析和时间关系提取等被广泛研究的任务,也包含话语小品词消歧(例如"just")等新颖挑战,同时还整合了关于多语言多框架话语关系分类的话语关系解析与树库构建的共享任务。我们在BeDiscovER上评估了开源大语言模型(Qwen3系列、DeepSeek-R1)以及前沿模型(如GPT-5-mini),发现最先进的模型在时间推理的算术方面表现出色,但在完整文档推理以及某些细微的语义和话语现象(如修辞关系识别)方面仍存在困难。