Segmenting text into fine-grained units of meaning is important to a wide range of NLP applications. The default approach of segmenting text into sentences is often insufficient, especially since sentences are usually complex enough to include multiple units of meaning that merit separate treatment in the downstream task. We focus on the task of abstractive proposition segmentation: transforming text into simple, self-contained, well-formed sentences. Several recent works have demonstrated the utility of proposition segmentation with few-shot prompted LLMs for downstream tasks such as retrieval-augmented grounding and fact verification. However, this approach does not scale to large amounts of text and may not always extract all the facts from the input text. In this paper, we first introduce evaluation metrics for the task to measure several dimensions of quality. We then propose a scalable, yet accurate, proposition segmentation model. We model proposition segmentation as a supervised task by training LLMs on existing annotated datasets and show that training yields significantly improved results. We further show that by using the fine-tuned LLMs as teachers for annotating large amounts of multi-domain synthetic distillation data, we can train smaller student models with results similar to the teacher LLMs. We then demonstrate that our technique leads to effective domain generalization, by annotating data in two domains outside the original training data and evaluating on them. Finally, as a key contribution of the paper, we share an easy-to-use API for NLP practitioners to use.
翻译:将文本分割为细粒度的意义单元对广泛的自然语言处理应用至关重要。将文本分割为句子的默认方法往往不足,特别是由于句子通常足够复杂,包含多个意义单元,这些单元在下游任务中值得单独处理。我们专注于抽象命题分割任务:将文本转化为简单、自包含、结构良好的句子。最近的一些研究展示了使用少样本提示的大型语言模型进行命题分割在下游任务(如检索增强的接地和事实验证)中的实用性。然而,这种方法无法扩展到大量文本,并且可能无法始终从输入文本中提取所有事实。在本文中,我们首先为该任务引入评估指标,以衡量多个质量维度。然后,我们提出了一种可扩展且准确的命题分割模型。我们通过基于现有标注数据集训练大型语言模型,将命题分割建模为监督任务,并证明训练能显著改善结果。我们进一步表明,通过使用微调后的大型语言模型作为教师,标注大量多领域合成蒸馏数据,我们可以训练较小的学生模型,其效果与教师大型语言模型相似。接着,我们通过标注原始训练数据之外的两个领域的数据并在其上评估,证明了我们的技术能实现有效的领域泛化。最后,作为本文的一个关键贡献,我们分享了一个易于使用的API,供自然语言处理从业者使用。