We present Paired by the Teacher (PbT), a two-stage teacher-student pipeline that synthesizes accurate input-output pairs without human labels or parallel data. In many low-resource natural language generation (NLG) scenarios, practitioners may have only raw outputs, like highlights, recaps, or questions, or only raw inputs, such as articles, dialogues, or paragraphs, but seldom both. This mismatch forces small models to learn from very few examples or rely on costly, broad-scope synthetic examples produced by large LLMs. PbT addresses this by asking a teacher LLM to compress each unpaired example into a concise intermediate representation (IR), and training a student to reconstruct inputs from IRs. This enables outputs to be paired with student-generated inputs, yielding high-quality synthetic data. We evaluate PbT on five benchmarks-document summarization (XSum, CNNDM), dialogue summarization (SAMSum, DialogSum), and question generation (SQuAD)-as well as an unpaired setting on SwitchBoard (paired with DialogSum summaries). An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora and other unsupervised baselines, coming within 1.2 ROUGE-L of human-annotated pairs and closing 82% of the oracle gap at one-third the annotation cost of direct synthesis. Human evaluation on SwitchBoard further confirms that only PbT produces concise, faithful summaries aligned with the target style, highlighting its advantage of generating in-domain sources that avoid the mismatch, limiting direct synthesis.
翻译:我们提出了“教师配对”(Paired by the Teacher,PbT),一种两阶段的师生流水线方法,能够在无需人工标注或平行数据的情况下合成精确的输入-输出配对。在许多低资源自然语言生成(NLG)场景中,实践者可能仅拥有原始输出(如要点、摘要或问题),或仅拥有原始输入(如文章、对话或段落),但很少同时拥有两者。这种不匹配迫使小型模型只能从极少数示例中学习,或依赖于大型LLM生成的、成本高昂且范围宽泛的合成示例。PbT通过让一个教师LLM将每个非配对示例压缩成一个简洁的中间表示(IR),并训练一个学生模型从IRs重构输入,从而解决了这一问题。这使得输出能够与学生生成的输入配对,从而产生高质量的合成数据。我们在五个基准测试上评估了PbT——文档摘要(XSum、CNNDM)、对话摘要(SAMSum、DialogSum)和问题生成(SQuAD),以及在SwitchBoard(与DialogSum摘要配对)上的非配对设置。一个仅使用PbT数据训练的8B学生模型,其性能优于在70B教师生成语料库上训练的模型以及其他无监督基线方法,其ROUGE-L分数与人工标注配对的差距在1.2以内,并以直接合成方法三分之一的人工标注成本,缩小了82%的与理想性能的差距。在SwitchBoard上进行的人工评估进一步证实,只有PbT能够生成简洁、忠实且符合目标风格的摘要,突显了其生成领域内源文本以避免不匹配的优势,而这正是直接合成方法的局限所在。