We study pre-training representations for decision-making using video data, which is abundantly available for tasks such as game agents and software testing. Even though significant empirical advances have been made on this problem, a theoretical understanding remains absent. We initiate the theoretical investigation into principled approaches for representation learning and focus on learning the latent state representations of the underlying MDP using video data. We study two types of settings: one where there is iid noise in the observation, and a more challenging setting where there is also the presence of exogenous noise, which is non-iid noise that is temporally correlated, such as the motion of people or cars in the background. We study three commonly used approaches: autoencoding, temporal contrastive learning, and forward modeling. We prove upper bounds for temporal contrastive learning and forward modeling in the presence of only iid noise. We show that these approaches can learn the latent state and use it to do efficient downstream RL with polynomial sample complexity. When exogenous noise is also present, we establish a lower bound result showing that the sample complexity of learning from video data can be exponentially worse than learning from action-labeled trajectory data. This partially explains why reinforcement learning with video pre-training is hard. We evaluate these representational learning methods in two visual domains, yielding results that are consistent with our theoretical findings.
翻译:我们研究利用视频数据预训练决策任务的表示,这些数据在游戏智能体和软件测试等任务中极为丰富。尽管该问题已取得显著实证进展,但理论理解仍存在空白。我们首次从理论层面探究表示学习的原则性方法,重点研究利用视频数据学习底层MDP的潜在状态表示。我们考察两种设定:一种为观测中存在独立同分布噪声,另一种更具挑战性——存在外源性噪声(即与时间相关的非独立同分布噪声,如背景中行人或车辆的移动)。我们分析三种常见方法:自编码、时序对比学习和前向建模。在仅存在独立同分布噪声的情况下,我们证明了时序对比学习和前向建模的上界,表明这些方法可学习潜在状态,并用于后续高效的下游强化学习(样本复杂度为多项式级)。当同时存在外源性噪声时,我们建立下界结论:从视频数据中学习的样本复杂度可能比从动作标注轨迹数据中学习呈指数级恶化。这部分解释了为何基于视频预训练的强化学习存在困难。我们在两个视觉域中评估这些表示学习方法,实验结果与理论发现一致。