The collection and curation of high-quality training data is crucial for developing text classification models with superior performance, but it is often associated with significant costs and time investment. Researchers have recently explored using large language models (LLMs) to generate synthetic datasets as an alternative approach. However, the effectiveness of the LLM-generated synthetic data in supporting model training is inconsistent across different classification tasks. To better understand factors that moderate the effectiveness of the LLM-generated synthetic data, in this study, we look into how the performance of models trained on these synthetic data may vary with the subjectivity of classification. Our results indicate that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic data. We conclude by discussing the implications of our work on the potential and limitations of leveraging LLM for synthetic data generation.
翻译:高质量训练数据的收集与整理对于开发性能卓越的文本分类模型至关重要,但这一过程往往伴随着高昂的成本和时间投入。研究人员近期开始探索利用大语言模型(LLMs)生成合成数据集作为替代方案。然而,LLM生成的合成数据在支持模型训练方面的有效性因分类任务而异。为深入理解调节LLM合成数据有效性的影响因素,本研究聚焦分类任务的主观性对基于这些合成数据训练的模型性能的影响机制。研究结果表明,无论是任务层面还是实例层面的主观性,均与合成数据训练模型的性能呈负相关。最后,本文探讨了相关发现对利用LLM进行合成数据生成的潜力与局限性所蕴含的启示。