Steering Language Generation: Harnessing Contrastive Expert Guidance and Negative Prompting for Coherent and Diverse Synthetic Data Generation

Large Language Models (LLMs) hold immense potential to generate synthetic data of high quality and utility, which has numerous applications from downstream model training to practical data utilisation. However, contemporary models, despite their impressive capacities, consistently struggle to produce both coherent and diverse data. To address the coherency issue, we introduce contrastive expert guidance, where the difference between the logit distributions of fine-tuned and base language models is emphasised to ensure domain adherence. In order to ensure diversity, we utilise existing real and synthetic examples as negative prompts to the model. We deem this dual-pronged approach to logit reshaping as STEER: Semantic Text Enhancement via Embedding Repositioning. STEER operates at inference-time and systematically guides the LLMs to strike a balance between adherence to the data distribution (ensuring semantic fidelity) and deviation from prior synthetic examples or existing real datasets (ensuring diversity and authenticity). This delicate balancing act is achieved by dynamically moving towards or away from chosen representations in the latent space. STEER demonstrates improved performance over previous synthetic data generation techniques, exhibiting better balance between data diversity and coherency across three distinct tasks: hypothesis generation, toxic and non-toxic comment generation, and commonsense reasoning task generation. We demonstrate how STEER allows for fine-tuned control over the diversity-coherency trade-off via its hyperparameters, highlighting its versatility.

翻译：大型语言模型（LLMs）在生成高质量且实用的合成数据方面具有巨大潜力，广泛应用于下游模型训练和实际数据利用等场景。然而，现有模型尽管能力出众，却在生成既连贯又多样化的数据方面持续面临挑战。为解决连贯性问题，我们引入了对比专家指导方法，通过强化微调语言模型与基座语言模型在对数概率分布上的差异，确保领域一致性。为保障多样性，我们利用现有真实数据和合成数据样本作为模型的负向提示。我们将这种双管齐下的对数概率重塑方法命名为STEER：基于嵌入重定位的语义文本增强。STEER在推理阶段运行，系统性地引导LLM在保持数据分布一致性（确保语义保真度）与偏离已有合成样本或真实数据集（确保多样性与真实性）之间取得平衡。这种精妙平衡通过动态调整潜在空间中表征向量的趋近或远离方向来实现。在假设生成、有毒/无害评论生成及常识推理任务生成三项不同任务中，STEER相较于先前合成数据生成技术展现出更优性能，在数据多样性与连贯性之间实现了更佳平衡。我们通过超参数灵活调节多样性-连贯性权衡的能力，充分验证了STEER的通用性。