Large Language Models (LLMs) hold immense potential to generate synthetic data of high quality and utility, which has numerous applications from downstream model training to practical data utilisation. However, contemporary models, despite their impressive capacities, consistently struggle to produce both coherent and diverse data. To address the coherency issue, we introduce contrastive expert guidance, where the difference between the logit distributions of fine-tuned and base language models is emphasised to ensure domain adherence. In order to ensure diversity, we utilise existing real and synthetic examples as negative prompts to the model. We deem this dual-pronged approach to logit reshaping as STEER: Semantic Text Enhancement via Embedding Repositioning. STEER operates at inference-time and systematically guides the LLMs to strike a balance between adherence to the data distribution (ensuring semantic fidelity) and deviation from prior synthetic examples or existing real datasets (ensuring diversity and authenticity). This delicate balancing act is achieved by dynamically moving towards or away from chosen representations in the latent space. STEER demonstrates improved performance over previous synthetic data generation techniques, exhibiting better balance between data diversity and coherency across three distinct tasks: hypothesis generation, toxic and non-toxic comment generation, and commonsense reasoning task generation. We demonstrate how STEER allows for fine-tuned control over the diversity-coherency trade-off via its hyperparameters, highlighting its versatility.
翻译:大型语言模型在生成高质量且实用的合成数据方面拥有巨大潜力,其应用涵盖从下游模型训练到实际数据利用等多个领域。然而,当前模型尽管能力显著,却始终难以同时生成一致且多样的数据。为解决一致性问题,我们引入了对比式专家指导,通过强调微调模型与基础语言模型之间的对数几率分布差异,确保领域一致性。为保证多样性,我们利用现有的真实数据及合成样本作为负向提示输入模型。我们将这种对数几率重塑的双轨方法称为STEER:基于表征重定位的语义文本增强。STEER在推理阶段运行,系统性地引导语言模型在遵循数据分布(确保语义保真度)与偏离先前合成样本或现有真实数据集(确保多样性与真实性)之间取得平衡。这种精细的平衡通过动态接近或远离潜在空间中选定表征来实现。实验表明,STEER在假设生成、有毒/无毒评论生成及常识推理任务生成这三项不同任务中,均展现出优于先前合成数据生成技术的性能,且能更好地平衡数据多样性与一致性。我们进一步展示了STEER如何通过其超参数实现对多样性-一致性权衡的精细控制,突显了其灵活性。