Topic-controlled summarisation enables users to generate summaries focused on specific aspects of source documents. This paper investigates a data augmentation strategy for training small language models (sLMs) to perform topic-controlled summarisation. We propose a pairwise data augmentation method that combines contexts from different documents to create contrastive training examples, enabling models to learn the relationship between topics and summaries more effectively. Using the SciTLDR dataset enriched with Wikipedia-derived topics, we systematically evaluate how augmentation scale affects model performance. Results show consistent improvements in win rate and semantic alignment as the augmentation scale increases, while the amount of real training data remains fixed. Consequently, a T5-base model trained with our augmentation approach achieves competitive performance relative to larger models, despite using significantly fewer parameters and substantially fewer real training examples.
翻译:主题控制摘要使用户能够生成聚焦于源文档特定方面的摘要。本文研究了一种数据增强策略,用于训练小语言模型(sLMs)执行主题控制摘要。我们提出了一种成对数据增强方法,该方法结合来自不同文档的上下文以生成对比训练示例,使模型能够更有效地学习主题与摘要之间的关系。通过使用基于维基百科主题增强的SciTLDR数据集,我们系统评估了增强规模对模型性能的影响。结果表明,随着增强规模的增加,胜率和语义对齐度持续提升,而真实训练数据量保持不变。因此,使用我们的增强方法训练的T5-base模型,尽管参数显著更少且真实训练样本大幅减少,仍能实现与更大模型相竞争的性能。