Automatic discourse processing is bottlenecked by data: current discourse formalisms pose highly demanding annotation tasks involving large taxonomies of discourse relations, making them inaccessible to lay annotators. This work instead adopts the linguistic framework of Questions Under Discussion (QUD) for discourse analysis and seeks to derive QUD structures automatically. QUD views each sentence as an answer to a question triggered in prior context; thus, we characterize relationships between sentences as free-form questions, in contrast to exhaustive fine-grained taxonomies. We develop the first-of-its-kind QUD parser that derives a dependency structure of questions over full documents, trained using a large, crowdsourced question-answering dataset DCQA (Ko et al., 2022). Human evaluation results show that QUD dependency parsing is possible for language models trained with this crowdsourced, generalizable annotation scheme. We illustrate how our QUD structure is distinct from RST trees, and demonstrate the utility of QUD analysis in the context of document simplification. Our findings show that QUD parsing is an appealing alternative for automatic discourse processing.
翻译:自动话语处理受限于数据瓶颈:当前话语形式体系提出了高度依赖标注的任务,涉及大规模话语关系分类体系,导致非专业标注者难以胜任。本研究采用"讨论中问句"(Questions Under Discussion, QUD)这一语言学框架进行话语分析,并尝试自动推导QUD结构。QUD将每个句子视为对前文引发问题的回答;因此,我们以自由形式的问题来刻画句子间的关系,而非采用穷举式的细粒度分类体系。我们开发了首个能够对整个文档推导问句依存结构的QUD解析器,其训练使用大规模众包问答数据集DCQA(Ko等,2022)。人工评估结果表明,使用这种可泛化的众包标注方案训练的语言模型能够实现QUD依存解析。我们展示了QUD结构与RST树之间的差异,并通过文档简化任务验证了QUD分析的实用性。研究结论表明,QUD解析是自动话语处理中一种具有吸引力的替代方案。