Topic segmentation and outline generation strive to divide a document into coherent topic sections and generate corresponding subheadings, unveiling the discourse topic structure of a document. Compared with sentence-level topic structure, the paragraph-level topic structure can quickly grasp and understand the overall context of the document from a higher level, benefitting many downstream tasks such as summarization, discourse parsing, and information retrieval. However, the lack of large-scale, high-quality Chinese paragraph-level topic structure corpora restrained relative research and applications. To fill this gap, we build the Chinese paragraph-level topic representation, corpus, and benchmark in this paper. Firstly, we propose a hierarchical paragraph-level topic structure representation with three layers to guide the corpus construction. Then, we employ a two-stage man-machine collaborative annotation method to construct the largest Chinese Paragraph-level Topic Structure corpus (CPTS), achieving high quality. We also build several strong baselines, including ChatGPT, to validate the computability of CPTS on two fundamental tasks (topic segmentation and outline generation) and preliminarily verified its usefulness for the downstream task (discourse parsing).
翻译:主题分割与大纲生成旨在将文档划分为连贯的主题段落并生成相应的子标题,揭示文档的篇章主题结构。相较于句子级主题结构,段落级主题结构能从更高层次快速把握和理解文档的整体脉络,有益于摘要生成、篇章解析和信息检索等下游任务。然而,缺乏大规模、高质量的中文段落级主题结构语料库制约了相关研究与应用的进展。为填补这一空白,本文构建了中文段落级主题表示、语料库与基准体系。首先,我们提出一种包含三层的层次化段落级主题结构表示方法,用以指导语料库构建。随后,采用两阶段人机协同标注方法,构建了最大规模的中文段落级主题结构语料库(CPTS),确保其高质量。我们还建立了包括ChatGPT在内的多个强基线模型,以验证CPTS在两项基础任务(主题分割与大纲生成)上的可计算性,并初步验证了其对下游任务(篇章解析)的实用性。