Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.
翻译:从人类反馈中学习通常依赖于偏好优化,该过程通过词元级正则化来约束策略更新。然而,语言模型的偏好优化尤其具有挑战性,因为词元空间的相似性并不意味着语义或行为相似性。为了解决这一挑战,我们利用隐空间正则化进行语言模型偏好优化。我们提出了GANPO,它通过惩罚策略模型与参考模型内部表示之间的差异来实现隐空间正则化。鉴于隐空间表示没有明确的概率密度函数,我们采用了一种受GAN启发的对抗性方法来最小化隐空间差异。我们将GANPO作为正则化项集成到现有的离线偏好优化目标中。在多种模型架构和任务上的实验表明,隐空间正则化带来了持续的改进。此外,通过比较GANPO与词元级正则化引入的推断偏差,我们发现GANPO在分布偏移和噪声条件下提供了更稳健的结构性反馈,同时以较小的计算开销保持了可比较的下游性能。