Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world towards a goal state. Behaviourally, we evaluate the agent against optimal policies across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and multi-goal structures. We then use probing methods to decode internal representations of the environment and multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from spatial cues towards immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.
翻译:理解代理的目标有助于解释和预测其行为,但目前尚无成熟的方法论可可靠地将目标归因于代理系统。我们提出一个将行为评估与基于可解释性的模型内部表征分析相结合的目标导向性评估框架。以案例研究为例,我们考察一个在二维网格世界中导航至目标状态的LLM代理。在行为层面,我们针对不同网格大小、障碍物密度和目标结构,将该代理与最优策略进行对比评估,发现其性能随任务难度提升而扩展,同时对难度保持性变换与多目标结构具有鲁棒性。随后,我们利用探测方法解码环境与多步行动计划的内部表征。研究发现:该LLM代理以非线性方式编码粗糙的空间地图,仅保留关于自身位置及目标位置的近似任务相关线索;其行动与这些内部表征基本一致;推理会重组表征,使表征从空间线索转向即时动作选择。本研究结果支持以下观点:除行为评估外,还需通过内省性分析来刻画代理如何表征并追求目标。