The Split and Rephrase (SPRP) task, which consists in splitting complex sentences into a sequence of shorter grammatical sentences, while preserving the original meaning, can facilitate the processing of complex texts for humans and machines alike. It is also a valuable testbed to evaluate natural language processing models, as it requires modelling complex grammatical aspects. In this work, we evaluate large language models on the task, showing that they can provide large improvements over the state of the art on the main metrics, although still lagging in terms of splitting compliance. Results from two human evaluations further support the conclusions drawn from automated metric results. We provide a comprehensive study that includes prompting variants, domain shift, fine-tuned pretrained language models of varying parameter size and training data volumes, contrasted with both zero-shot and few-shot approaches on instruction-tuned language models. Although the latter were markedly outperformed by fine-tuned models, they may constitute a reasonable off-the-shelf alternative. Our results provide a fine-grained analysis of the potential and limitations of large language models for SPRP, with significant improvements achievable using relatively small amounts of training data and model parameters overall, and remaining limitations for all models on the task.
翻译:拆分与重述(SPRP)任务旨在将复杂句子拆分为一系列语法正确的短句,同时保持原意,该任务能够促进人类和机器对复杂文本的处理。它同样是评估自然语言处理模型的重要测试平台,因其需要对复杂的语法层面进行建模。本研究评估了大型语言模型在该任务上的表现,结果表明虽然模型在拆分合规性方面仍有不足,但在核心指标上能较现有最佳成果实现显著提升。两项人工评估结果进一步佐证了自动化指标得出的结论。我们通过系统研究涵盖了提示变体、领域偏移、不同参数量与训练数据量的微调预训练语言模型,并与指令调优语言模型的零样本及少样本方法进行对比。尽管后者的表现显著逊于微调模型,它们仍可作为合理的即用型替代方案。本研究结果对大型语言模型在SPRP任务中的潜力与局限进行了细粒度分析:使用相对较少的训练数据和模型参数即可实现显著改进,但所有模型在该任务上仍存在固有局限。