Offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. Typically, preference optimization is approached as an offline supervised learning task using manually-crafted convex loss functions. While these methods are based on theoretical insights, they are inherently constrained by human creativity, so the large search space of possible loss functions remains under explored. We address this by performing LLM-driven objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention. Specifically, we iteratively prompt an LLM to propose and implement new preference optimization loss functions based on previously-evaluated performance metrics. This process leads to the discovery of previously-unknown and performant preference optimization algorithms. The best performing of these we call Discovered Preference Optimization (DiscoPOP), a novel algorithm that adaptively blends logistic and exponential losses. Experiments demonstrate the state-of-the-art performance of DiscoPOP and its successful transfer to held-out tasks.
翻译:离线偏好优化是提升和控制大型语言模型(LLM)输出质量的核心方法。通常,偏好优化被视为使用人工设计的凸损失函数进行的离线监督学习任务。尽管这些方法基于理论见解,但它们本质上受限于人类创造力,因此可能损失函数的巨大搜索空间仍未得到充分探索。我们通过执行LLM驱动的目标发现来解决这一问题,从而在没有(专家)人工干预的情况下自动发现新的最先进偏好优化算法。具体而言,我们迭代地提示LLM根据先前评估的性能指标提出并实现新的偏好优化损失函数。这一过程导致了先前未知且性能优异的偏好优化算法的发现。其中表现最佳者我们称之为"发现式偏好优化"(DiscoPOP),这是一种自适应融合逻辑损失与指数损失的新型算法。实验证明了DiscoPOP的最先进性能及其在预留任务上的成功迁移能力。