Ensuring alignment with human preferences is a crucial characteristic of large language models (LLMs). Presently, the primary alignment methods, RLHF and DPO, require extensive human annotation, which is expensive despite their efficacy. The significant expenses associated with current alignment techniques motivate researchers to investigate the development of annotation-free alignment training methods. In pursuit of improved alignment without relying on external annotation, we introduce Latent Distance Guided Alignment Training (LD-Align). This approach seeks to align the model with a high-quality supervised fine-tune dataset using guidance from a latent space. The latent space is generated through sample reconstruction, akin to auto-encoding. Consequently, we utilize the distance between sample pairs in the latent space to guide DPO-based alignment training. Extensive experimentation and evaluation show the efficacy of our proposed method in achieving notable alignment.
翻译:确保与人类偏好对齐是大语言模型的关键特性。当前主流的对齐方法RLHF和DPO尽管有效,但需要大量人工标注,成本高昂。现有对齐技术的高昂成本促使研究者探索无标注对齐训练方法。为在不依赖外部标注的情况下实现更优对齐,我们提出潜在距离引导对齐训练(LD-Align)。该方法旨在利用潜在空间的引导,使模型与高质量监督微调数据集对齐。该潜在空间通过类似自编码的样本重构生成。因此,我们利用潜在空间中样本对之间的距离来指导基于DPO的对齐训练。大量实验与评估表明,我们提出的方法在实现显著对齐效果方面具有有效性。