Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance. Additionally, we present a comprehensive empirical study on data generation encompassing vital aspects like bias, diversity, and efficiency, and highlight three key observations: firstly, synthetic datasets generated by simple prompts exhibit significant biases, such as regional bias; secondly, attribute diversity plays a pivotal role in enhancing model performance; lastly, attributed prompts achieve the performance of simple class-conditional prompts while utilizing only 5\% of the querying cost of ChatGPT associated with the latter. We release the generated dataset and used prompts to facilitate future research. The data and code will be available on \url{https://github.com/yueyu1030/AttrPrompt}.
翻译:大型语言模型(LLMs)近年来被广泛用作各类自然语言处理(NLP)任务的训练数据生成器。尽管已有研究探索了利用生成数据训练模型的不同方法,但它们通常依赖简单的类条件提示,这可能会限制生成数据的多样性并继承LLM的系统性偏差。因此,我们研究了带有归因属性的多样化提示(例如指定长度和风格等属性)的训练数据生成方法,这类方法有望生成多样且具有归因属性的数据。我们的研究聚焦于高基数、多领域数据集的场景,实验表明归因提示在提升模型性能方面优于简单类条件提示。此外,我们围绕偏差、多样性和效率等关键维度开展了全面的实证研究,揭示了三个重要发现:其一,简单提示生成的合成数据集存在显著偏差(如区域偏差);其二,属性多样性对提升模型性能具有关键作用;其三,归因提示在仅消耗ChatGPT 5%查询成本的情况下即可达到与简单类条件提示相当的性能。我们公开了生成的原始数据集及所用提示,以促进后续研究。相关数据与代码将在\url{https://github.com/yueyu1030/AttrPrompt}上发布。