Learning from Preferential Feedback (LfPF) plays an essential role in training Large Language Models, as well as certain types of interactive learning agents. However, a substantial gap exists between the theory and application of LfPF algorithms. Current results guaranteeing the existence of optimal policies in LfPF problems assume that both the preferences and transition dynamics are determined by a Markov Decision Process. We introduce the Direct Preference Process, a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. Within this framework, we establish conditions that guarantee the existence of optimal policies by considering the ordinal structure of the preferences. Using the von Neumann-Morgenstern Expected Utility Theorem, we show that the Direct Preference Process generalizes the standard reinforcement learning problem. Our findings narrow the gap between the empirical success and theoretical understanding of LfPF algorithms and provide future practitioners with the tools necessary for a more principled design of LfPF agents.
翻译:从偏好反馈学习(LfPF)在训练大型语言模型以及某些类型的交互式学习智能体中发挥着关键作用。然而,LfPF算法的理论与应用之间存在显著差距。当前确保LfPF问题中存在最优策略的结果假设偏好和转移动态均由马尔可夫决策过程决定。我们引入了直接偏好过程(Direct Preference Process),这是一个用于分析部分可观测、非马尔可夫环境中LfPF问题的新框架。在此框架内,我们通过考虑偏好的序结构,建立了保证最优策略存在的条件。利用冯·诺依曼-摩根斯坦期望效用定理,我们证明了直接偏好过程泛化了标准强化学习问题。我们的研究结果缩小了LfPF算法经验成功与理论理解之间的差距,并为未来从业者提供了设计更具原则性的LfPF智能体所需的工具。