Zero-shot reinforcement learning (RL) has emerged as a setting for developing general agents, capable of solving downstream tasks without additional training or planning at test-time. While conventional RL optimizes policies for fixed rewards, zero-shot RL requires learning representations that enable immediate adaptation to arbitrary reward functions. As the field matures, the growing diversity of approaches demands a foundational framework reconciling different perspectives under a common unifying structure. In this work, we introduce a formal, unified framework for zero-shot RL, allowing for rigorous comparisons across methods. We propose a taxonomy organizing the algorithmic landscape along two levels: representation, distinguishing between compositional and direct methods based on their exploitation of action-value function decompositions; and learning paradigm, differentiating between reward-free and pseudo reward-free training. Additionally, we propose a unified view of existing error bounds, decomposing the total error into three primary contributing components: inference, reward, and approximation, serving as a foundation for more grounded comparisons of zero-shot methods.
翻译:零样本强化学习已成为开发通用智能体的一种设定,使智能体能够在测试时无需额外训练或规划即可解决下游任务。传统强化学习针对固定奖励函数优化策略,而零样本强化学习则需要学习能够立即适应任意奖励函数的表示。随着该领域的成熟,日益多样化的方法需要一个基础性框架,在统一结构下调和不同视角。本研究提出一个形式化的零样本强化学习统一框架,支持不同方法间的严格比较。我们提出一种分类法,从两个层面组织算法体系:在表示层面,根据对动作价值函数分解的利用方式,区分为组合式方法与直接式方法;在学习范式层面,区分无奖励训练与伪无奖励训练。此外,我们提出对现有误差界的统一视角,将总误差分解为三个主要构成部分:推断误差、奖励误差与近似误差,为零样本方法的更基础比较提供理论依据。