Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world toward a goal state. Behaviourally, we evaluate the agent against an optimal policy across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and complex goal structures. We then use probing methods to decode the agent's internal representations of the environment state and its multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map of the environment, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from broader environment structural cues toward information supporting immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.
翻译:理解智能体的目标有助于解释和预测其行为,但目前尚无可靠的方法论来稳定地将目标归因于智能体系统。我们提出了一个评估目标导向性的框架,该框架将行为评估与基于可解释性的模型内部表征分析相结合。作为案例研究,我们考察了一个在二维网格世界中朝向目标状态导航的LLM智能体。在行为层面,我们评估了该智能体在不同网格尺寸、障碍物密度和目标结构下相对于最优策略的表现,发现其性能随任务难度增加而提升,同时对保持难度不变的变换和复杂目标结构表现出鲁棒性。随后,我们使用探针方法解码智能体对环境状态及其多步行动计划的内部表征。我们发现:LLM智能体以非线性方式编码环境的粗略空间地图,保留了关于其自身位置和目标位置的近似任务相关线索;其行动与这些内部表征大体一致;推理过程会重组这些表征,使其从更广泛的环境结构线索转向支持即时行动选择的信息。我们的研究结果支持以下观点:要刻画智能体如何表征并追求其目标,需要在行为评估之外进行内省式的考察。