Large language models (LLMs) have great potential for synthetic data generation. This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by LLMs: for problems with structured outputs, it is possible to prompt an LLM to perform the task in the reverse direction, by generating plausible input text for a target output structure. Leveraging this asymmetry in task difficulty makes it possible to produce large-scale, high-quality data for complex tasks. We demonstrate the effectiveness of this approach on closed information extraction, where collecting ground-truth data is challenging, and no satisfactory dataset exists to date. We synthetically generate a dataset of 1.8M data points, establish its superior quality compared to existing datasets in a human evaluation, and use it to finetune small models (220M and 770M parameters), termed SynthIE, that outperform the prior state of the art (with equal model size) by a substantial margin of 57 absolute points in micro-F1 and 79 points in macro-F1. Code, data, and models are available at https://github.com/epfl-dlab/SynthIE.
翻译:大型语言模型(LLM)在合成数据生成方面具有巨大潜力。本研究表明,即使对于LLM无法直接解决的任务,也能合成生成有用数据:针对具有结构化输出的问题,可以通过逆向提示LLM执行任务,即为目标输出结构生成合理的输入文本。利用任务难度的这种不对称性,能够为复杂任务生成大规模、高质量的数据。我们在封闭式信息抽取任务上验证了该方法的有效性——该任务中真实数据收集极具挑战性,且目前尚未存在令人满意的数据集。我们合成生成包含180万个数据点的数据集,通过人工评估证明其质量优于现有数据集,并利用该数据集微调小规模模型(220M和770M参数),命名为SynthIE。在相同模型规模下,该模型在微观F1值和宏观F1值上分别以57个绝对百分点和79个绝对百分点的显著优势超越先前最优水平。代码、数据和模型已开源至https://github.com/epfl-dlab/SynthIE。