Large language models (LLMs) show great potential for synthetic data generation. This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by the LLM: we show that, for problems with structured outputs, it is possible to prompt an LLM to perform the task in the opposite direction, to generate plausible text for the target structure. Leveraging the asymmetry in task difficulty makes it possible to produce large-scale, high-quality data for complex tasks. We demonstrate the effectiveness of this approach on closed information extraction, where collecting ground-truth data is challenging, and no satisfactory dataset exists to date. We synthetically generate a dataset of 1.8M data points, demonstrate its superior quality compared to existing datasets in a human evaluation and use it to finetune small models (220M and 770M parameters). The models we introduce, SynthIE, outperform existing baselines of comparable size with a substantial gap of 57 and 79 absolute points in micro and macro F1, respectively. Code, data, and models are available at https://github.com/epfl-dlab/SynthIE.
翻译:大语言模型(LLMs)在合成数据生成方面展现出巨大潜力。本研究表明,即使对于LLM无法直接解决的任务,也能合成生成有用数据:我们论证,对于具有结构化输出的问题,可以引导LLM以相反方向执行任务,为目标结构生成合理文本。利用任务难度的非对称性,使得能够为复杂任务生成大规模高质量数据。我们通过封闭信息抽取任务验证了该方法的有效性——该任务中收集真实标注数据极具挑战性,且目前尚无令人满意的数据集。我们合成生成了包含180万个数据点的数据集,通过人工评估证明了其相较于现有数据集的优越质量,并利用该数据集微调小规模模型(参数规模为220M和770M)。我们提出的SynthIE模型在微观F1和宏观F1上分别以57和79个绝对分值的显著优势超越同等规模现有基准。代码、数据及模型已发布于https://github.com/epfl-dlab/SynthIE。