Large language models (LLMs) can be used to generate text data for training and evaluating other models. However, creating high-quality datasets with LLMs can be challenging. In this work, we explore human-AI partnerships to facilitate high diversity and accuracy in LLM-based text data generation. We first examine two approaches to diversify text generation: 1) logit suppression, which minimizes the generation of languages that have already been frequently generated, and 2) temperature sampling, which flattens the token sampling probability. We found that diversification approaches can increase data diversity but often at the cost of data accuracy (i.e., text and labels being appropriate for the target domain). To address this issue, we examined two human interventions, 1) label replacement (LR), correcting misaligned labels, and 2) out-of-scope filtering (OOSF), removing instances that are out of the user's domain of interest or to which no considered label applies. With oracle studies, we found that LR increases the absolute accuracy of models trained with diversified datasets by 14.4%. Moreover, we found that some models trained with data generated with LR interventions outperformed LLM-based few-shot classification. In contrast, OOSF was not effective in increasing model accuracy, implying the need for future work in human-in-the-loop text data generation.
翻译:大语言模型(LLMs)可用于生成训练和评估其他模型的文本数据。然而,使用LLMs创建高质量数据集颇具挑战性。本研究探索人机协作机制,以促进基于LLM的文本数据生成过程中实现高多样性与准确性。我们首先考察两种文本生成多样化方法:1)对数几率抑制,即最小化已频繁生成语言的出现;2)温度采样,即平缓化token采样概率。研究发现多样化方法虽能提升数据多样性,但往往以牺牲数据准确性(即文本与标签对目标域的适用性)为代价。为解决该问题,我们研究了两种人类干预策略:1)标签替换(LR),纠正错误对齐的标签;2)域外过滤(OOSF),移除超出用户兴趣域或无对应标签的样本。通过参考标准实验,我们发现LR可使基于多样化数据集训练的模型绝对准确率提升14.4%。此外,采用LR干预生成数据训练的某些模型性能甚至超越基于LLM的小样本分类方法。相比之下,OOSF未能有效提升模型准确率,表明未来需要在人机协同文本数据生成领域开展进一步研究。