While large language models (LLMs) bring not only performance but also complexity, recent work has started to turn LLMs into data generators rather than task inferencers, where another affordable task model is trained for efficient deployment and inference. However, such an approach has primarily been applied to natural language tasks and has not yet been explored for symbolic language tasks with complex structured outputs (e.g., semantic parsing and code generation). In this paper, we propose SymGen which utilizes LLMs for generating various annotation-expensive symbolic language data. SymGen consists of an informative prompt to steer generation and an agreement-based verifier to improve data correctness. We conduct extensive experiments on six symbolic language tasks across various settings. Compared with the LLMs, we demonstrate the 1\%-sized task model can achieve comparable or better performance, largely cutting inference and deployment costs. We also show that generated data with only a few human demonstrations can be as effective as over 10 times the amount of human-annotated data when training the task model, saving a considerable amount of annotation effort. SymGen sheds new light on data generation for complex tasks, and we release the code at \href{https://github.com/HKUNLP/SymGen}{https://github.com/HKUNLP/SymGen}.
翻译:尽管大语言模型(LLMs)在带来性能提升的同时也引入了复杂性,近期研究开始将LLM用作数据生成器而非任务推理器,通过训练另一个成本较低的任务模型来实现高效的部署与推理。然而,此类方法目前主要应用于自然语言任务,尚未在具有复杂结构化输出的符号语言任务(如语义解析、代码生成)中展开探索。本文提出SymGen方法,利用LLM生成各类标注成本高昂的符号语言数据。SymGen包含引导生成的信息性提示和基于验证机制的共识校验器以提高数据正确性。我们在六种不同场景下的符号语言任务上开展了广泛实验。结果表明,与LLM相比,仅为其1%量级的任务模型即可达到相近或更优性能,大幅降低了推理与部署成本。同时我们发现,在训练任务模型时,仅使用少量人工演示数据生成的样本,其效果可与超过十倍规模的人工标注数据相媲美,显著节约了人工标注成本。SymGen为复杂任务的数据生成开辟了新路径,相关代码已开源至\href{https://github.com/HKUNLP/SymGen}{https://github.com/HKUNLP/SymGen}。