As machine learning has moved towards leveraging large models as priors for downstream tasks, the community has debated the right form of prior for solving reinforcement learning (RL) problems. If one were to try to prefetch as much computation as possible, they would attempt to learn a prior over the policies for some yet-to-be-determined reward function. Recent work (forward-backward (FB) representation learning) has tried this, arguing that an unsupervised representation learning procedure can enable optimal control over arbitrary rewards without further fine-tuning. However, FB's training objective and learning behavior remain mysterious. In this paper, we demystify FB by clarifying when such representations can exist, what its objective optimizes, and how it converges in practice. We draw connections with rank matching, fitted Q-evaluation, and contraction mapping. Our analysis suggests a simplified unsupervised pre-training method for RL that, instead of enabling optimal control, performs one step of policy improvement. We call our proposed method $\textbf{one-step forward-backward representation learning (one-step FB)}$. Experiments in didactic settings, as well as in $10$ state-based and image-based continuous control domains, demonstrate that one-step FB converges to errors $10^5$ smaller and improves zero-shot performance by $+24\%$ on average. Our project website is available at https://chongyi-zheng.github.io/onestep-fb.
翻译:随着机器学习逐渐转向利用大型模型作为下游任务的先验,学界一直在争论解决强化学习(RL)问题的正确先验形式。若试图尽可能预先获取计算,人们会尝试为某些尚未确定的奖励函数学习策略的先验。近期工作(前向-后向表示学习)尝试了这种方法,认为无监督表示学习过程能够实现对任意奖励的最优控制,而无需进一步微调。然而,FB的训练目标和学习行为仍然神秘。在本文中,我们通过阐明此类表示何时存在、其目标优化内容以及实际收敛方式,来揭示FB的本质。我们建立了与秩匹配、拟合Q评估和压缩映射的联系。我们的分析提出了一种简化的无监督预训练方法,该方法不追求实现最优控制,而是执行单步策略改进。我们称所提出的方法为$\textbf{单步前向-后向表示学习}$。在示例性环境以及$10$个基于状态和图像的连续控制领域中的实验表明,单步FB的收敛误差缩小$10^5$倍,并将零样本性能平均提升$+24\%$。项目网站详见https://chongyi-zheng.github.io/onestep-fb。