Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.
翻译:在大规模自回归模型通过下一令牌预测进行预训练,并利用强化学习(RL)进行微调的方法已在众多问题领域取得了前所未有的成功。在强化学习过程中,这些模型通过逐个令牌生成新输出来进行探索。然而,逐令牌采样动作可能导致学习效率极低,尤其是在奖励稀疏的情况下。本文表明,通过在自回归模型的内部表示中进行行动和探索,可以克服这一问题。具体而言,为了发现时间抽象动作,我们引入了一个高阶非因果序列模型,其输出控制着一个基础自回归模型的残差流激活。在具有层次结构的网格世界和基于MuJoCo的任务中,我们发现高阶模型学会将长激活序列块压缩到内部控制器上。关键的是,每个控制器执行一系列在长时间尺度上展开、具有行为意义且伴随学习到的终止条件的动作序列,使得随时间组合多个控制器能够在新型任务上实现高效探索。我们证明,直接对内部控制器进行强化——我们称之为"内部RL"的过程——能够在标准RL微调失败的情况下,从稀疏奖励中实现学习。我们的结果展示了在自回归模型中潜在动作生成与强化的优势,表明内部RL是实现基础模型中分层强化学习的一条有前景的途径。