Behavioral Foundation Models (BFMs) proved successful in producing policies for arbitrary tasks in a zero-shot manner, requiring no test-time training or task-specific fine-tuning. Among the most promising BFMs are the ones that estimate the successor measure learned in an unsupervised way from task-agnostic offline data. However, these methods fail to react to changes in the dynamics, making them inefficient under partial observability or when the transition function changes. This hinders the applicability of BFMs in a real-world setting, e.g., in robotics, where the dynamics can unexpectedly change at test time. In this work, we demonstrate that Forward-Backward (FB) representation, one of the methods from the BFM family, cannot distinguish between distinct dynamics, leading to an interference among the latent directions, which parametrize different policies. To address this, we propose a FB model with a transformer-based belief estimator, which greatly facilitates zero-shot adaptation. We also show that partitioning the policy encoding space into dynamics-specific clusters, aligned with the context-embedding directions, yields additional gain in performance. These traits allow our method to respond to the dynamics observed during training and to generalize to unseen ones. Empirically, in the changing dynamics setting, our approach achieves up to a 2x higher zero-shot returns compared to the baselines for both discrete and continuous tasks.
翻译:行为基础模型(BFM)在零样本方式下为任意任务生成策略方面取得了成功,无需测试时训练或任务特定微调。其中最有前景的BFM是通过无监督方式从任务无关离线数据学习后继度量的模型。然而,这些方法无法响应动态变化,因此在部分可观测性条件下或转移函数发生变化时效率低下。这阻碍了BFM在现实场景(如机器人技术)中的适用性,因为在测试阶段动态可能意外改变。在本工作中,我们证明前向-后向(FB)表示(BFM家族方法之一)无法区分不同动态,导致参数化不同策略的潜在方向间产生干扰。为解决此问题,我们提出了一种基于Transformer的信念估计器增强的FB模型,极大促进了零样本适应。我们还展示将策略编码空间划分为动态特定聚类(与上下文嵌入方向对齐)可进一步提升性能。这些特性使我们的方法能够响应训练中观察到的动态并泛化至未知动态。实验表明,在变化动态场景下,我们的方法在离散和连续任务中的零样本回报比基线方法最高提升2倍。