Questions Under Discussion (QUD) is a versatile linguistic framework in which discourse progresses as continuously asking questions and answering them. Automatic parsing of a discourse to produce a QUD structure thus entails a complex question generation task: given a document and an answer sentence, generate a question that satisfies linguistic constraints of QUD and can be grounded in an anchor sentence in prior context. These questions are known to be curiosity-driven and open-ended. This work introduces the first framework for the automatic evaluation of QUD parsing, instantiating the theoretical constraints of QUD in a concrete protocol. We present QUDeval, a dataset of fine-grained evaluation of 2,190 QUD questions generated from both fine-tuned systems and LLMs. Using QUDeval, we show that satisfying all constraints of QUD is still challenging for modern LLMs, and that existing evaluation metrics poorly approximate parser quality. Encouragingly, human-authored QUDs are scored highly by our human evaluators, suggesting that there is headroom for further progress on language modeling to improve both QUD parsing and QUD evaluation.
翻译:问题导向话语(Questions Under Discussion, QUD)是一种通用的语言学框架,其核心思想是话语通过不断提问和回答问题而推进。因此,自动解析话语以生成QUD结构需要处理复杂的提问任务:给定一篇文档和一个答案句子,生成一个既符合QUD语言学约束、又能与先前语境中锚点句子相呼应的问题。这类问题具有好奇心驱动和开放性的特点。本文首次提出了QUD解析的自动评估框架,将QUD的理论约束具体化为可操作的评估协议。我们构建了QUDeval数据集,包含对来自微调系统和大型语言模型生成的2190个QUD问题的细粒度评估。基于QUDeval,我们发现:即使对现代大语言模型而言,完全满足QUD的所有约束仍具挑战性;现有评估指标难以准确反映解析器质量。令人鼓舞的是,人类标注者撰写的QUD问题在人工评估中获得了高分,这表明语言模型在提升QUD解析与评估性能方面仍有较大提升空间。