Large Language Models (LLMs) have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about (in other words, it is unfaithful). In a case study on sarcasm detection, we study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation. We evaluate these strategies using the performance of classifiers trained with generated synthetic data on real-world data. While all three strategies improve the performance of classifiers, we find that grounding works best for the task at hand. As synthetic data generation plays an ever-increasing role in NLP research, we expect this work to be a stepping stone in improving its utility. We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.
翻译:大语言模型(LLMs)已使合成数据生成民主化,这反过来又有潜力简化并拓宽各类自然语言处理任务。本文针对合成数据生成中的一个普遍问题:其生成分布往往与研究关注真实世界数据的分布存在差异(即不忠实)。在讽刺检测的案例研究中,我们探讨了三种提高合成数据忠实性的策略:基于真实数据的约束生成、过滤筛选以及分类体系引导生成。我们通过使用生成合成数据训练的模型在真实数据集上的分类性能来评估这些策略。尽管三种策略均能提升分类器性能,但研究发现针对当前任务,基于真实数据的约束生成策略效果最佳。随着合成数据生成在自然语言处理研究中扮演日益重要的角色,本研究期望能成为提升其实用性的基石。最后,我们针对如何生成面向特定任务的高忠实度合成数据提出了若干建议。