The conversational machine reading comprehension (CMRC) task aims to answer questions in conversations, which has been a hot research topic in recent years because of its wide applications. However, existing CMRC benchmarks in which each conversation is assigned a static passage are inconsistent with real scenarios. Thus, model's comprehension ability towards real scenarios are hard to evaluate reasonably. To this end, we propose the first Chinese CMRC benchmark Orca and further provide zero-shot/few-shot settings to evaluate model's generalization ability towards diverse domains. We collect 831 hot-topic driven conversations with 4,742 turns in total. Each turn of a conversation is assigned with a response-related passage, aiming to evaluate model's comprehension ability more reasonably. The topics of conversations are collected from social media platform and cover 33 domains, trying to be consistent with real scenarios. Importantly, answers in Orca are all well-annotated natural responses rather than the specific spans or short phrase in previous datasets. Besides, we implement three strong baselines to tackle the challenge in Orca. The results indicate the great challenge of our CMRC benchmark. Our datatset and checkpoints are available at https://github.com/nuochenpku/Orca.
翻译:对话式机器阅读理解任务旨在回答对话中的问题,因其广泛的应用场景,近年来成为研究热点。然而,现有对话式机器阅读理解基准中每段对话均对应静态段落,这与真实场景不符。因此,模型在真实场景中的理解能力难以得到合理评估。为此,我们提出了首个中文对话式机器阅读理解基准Orca,并进一步提供零样本/少样本设置,以评估模型在不同领域的泛化能力。我们收集了831个热点话题驱动的对话,共计4742轮次。每轮对话均分配一个与回复相关的段落,旨在更合理地评估模型的理解能力。对话话题来源于社交媒体平台,涵盖33个领域,力求贴近真实场景。重要的是,Orca中的答案均为精心标注的自然回复,而非以往数据集中的特定片段或短短语。此外,我们实现了三个强基线模型以应对Orca中的挑战。结果表明,我们的对话式机器阅读理解基准具有重大挑战性。数据集和检查点可在https://github.com/nuochenpku/Orca获取。