Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.
翻译:基于下一词预测预训练并通过强化学习微调的大规模自回归模型已在众多问题领域取得前所未有的成功。在强化学习过程中,这些模型通过逐个生成新词元来探索动作空间。然而,这种逐词元采样的动作生成方式可能导致学习效率低下,尤其在奖励稀疏的情况下。本文研究表明,通过基于自回归模型内部表征进行动作探索可有效解决该问题。具体而言,为发现时间抽象动作,我们引入了一个高阶非因果序列模型,其输出控制基础自回归模型的残差流激活状态。在具有层次结构的网格世界和MuJoCo任务中,高阶模型学会将长激活序列块压缩至内部控制器。关键的是,每个控制器执行的行为动作序列具有明确的行为意义,在长时间尺度上展开,并配备学习得到的终止条件,使得多个控制器在时间维度上的组合能够在新任务中实现高效探索。我们提出的"内部强化学习"方法——即直接对内部控制器进行强化——能够在标准强化学习微调失效的稀疏奖励场景中实现有效学习。研究结果揭示了自回归模型中潜在动作生成与强化的优势,表明内部强化学习为实现基础模型中的分层强化学习提供了可行路径。