There is increasing focus on adapting predictive models into agent-like systems, most notably AI assistants based on language models. We outline two structural reasons for why these models can fail when turned into agents. First, we discuss auto-suggestive delusions. Prior work has shown theoretically that models fail to imitate agents that generated the training data if the agents relied on hidden observations: the hidden observations act as confounding variables, and the models treat actions they generate as evidence for nonexistent observations. Second, we introduce and formally study a related, novel limitation: predictor-policy incoherence. When a model generates a sequence of actions, the model's implicit prediction of the policy that generated those actions can serve as a confounding variable. The result is that models choose actions as if they expect future actions to be suboptimal, causing them to be overly conservative. We show that both of those failures are fixed by including a feedback loop from the environment, that is, re-training the models on their own actions. We give simple demonstrations of both limitations using Decision Transformers and confirm that empirical results agree with our conceptual and formal analysis. Our treatment provides a unifying view of those failure modes, and informs the question of why fine-tuning offline learned policies with online learning makes them more effective.
翻译:越来越多的研究关注将预测模型改造为类智能体系统,其中最突出的是基于语言模型的AI助手。我们概述了此类模型转化为智能体时可能失效的两个结构性原因。首先,我们讨论了"自我暗示妄想"问题。先前理论研究表明,若训练数据生成过程中智能体依赖隐藏观测变量,模型将无法模仿该智能体行为:隐藏观测变量作为混杂因子,导致模型将自身生成的动作视为非真实观测的证据。其次,我们引入并正式研究了相关的新型局限性——"预测者-策略不一致性"。当模型生成动作序列时,其对生成该动作序列的策略的隐含预测会作为混杂变量。这导致模型在选择动作时仿佛预期后续动作存在次优性,从而产生过度保守的行为。我们证明这两种失效可通过引入环境反馈回路(即基于模型自身动作进行重训练)加以修正。通过使用决策变压器进行的简单实验验证了这两个局限性,且实证结果与我们的概念及形式化分析一致。我们的研究为这些失效模式提供了统一视角,并解释了为何使用在线学习微调离线学习策略能提升其有效性。