Scalable and Domain-General Abstractive Proposition Segmentation

Segmenting text into fine-grained units of meaning is important to a wide range of NLP applications. The default approach of segmenting text into sentences is often insufficient, especially since sentences are usually complex enough to include multiple units of meaning that merit separate treatment in the downstream task. We focus on the task of abstractive proposition segmentation: transforming text into simple, self-contained, well-formed sentences. Several recent works have demonstrated the utility of proposition segmentation with few-shot prompted LLMs for downstream tasks such as retrieval-augmented grounding and fact verification. However, this approach does not scale to large amounts of text and may not always extract all the facts from the input text. In this paper, we first introduce evaluation metrics for the task to measure several dimensions of quality. We then propose a scalable, yet accurate, proposition segmentation model. We model proposition segmentation as a supervised task by training LLMs on existing annotated datasets and show that training yields significantly improved results. We further show that by using the fine-tuned LLMs as teachers for annotating large amounts of multi-domain synthetic distillation data, we can train smaller student models with results similar to the teacher LLMs. We then demonstrate that our technique leads to effective domain generalization, by annotating data in two domains outside the original training data and evaluating on them. Finally, as a key contribution of the paper, we share an easy-to-use API for NLP practitioners to use.

翻译：将文本分割为细粒度的意义单元对广泛的自然语言处理应用至关重要。将文本分割为句子的默认方法往往不足，特别是由于句子通常足够复杂，包含多个意义单元，这些单元在下游任务中值得单独处理。我们专注于抽象命题分割任务：将文本转化为简单、自包含、结构良好的句子。最近的一些研究展示了使用少样本提示的大型语言模型进行命题分割在下游任务（如检索增强的接地和事实验证）中的实用性。然而，这种方法无法扩展到大量文本，并且可能无法始终从输入文本中提取所有事实。在本文中，我们首先为该任务引入评估指标，以衡量多个质量维度。然后，我们提出了一种可扩展且准确的命题分割模型。我们通过基于现有标注数据集训练大型语言模型，将命题分割建模为监督任务，并证明训练能显著改善结果。我们进一步表明，通过使用微调后的大型语言模型作为教师，标注大量多领域合成蒸馏数据，我们可以训练较小的学生模型，其效果与教师大型语言模型相似。接着，我们通过标注原始训练数据之外的两个领域的数据并在其上评估，证明了我们的技术能实现有效的领域泛化。最后，作为本文的一个关键贡献，我们分享了一个易于使用的API，供自然语言处理从业者使用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日