Benchmark scores tell you what an agent got right; they do not tell you how it got there. In this work, we introduce methods for comparing agents procedurally in different contexts, where the model, tasks, and approaches vary. We compare ten agents and find that they are identifiable by their behavioral habits, which we define as fingerprints: a probe over these procedural signatures attributes an unseen trajectory to the correct agent at 85.7% accuracy, controlling for leakage across tasks. We develop procedural representations for agent problem-solving procedures with an emergent vocabulary induction technique that is meant to be maximally compressive to avoid surface-level variation while being expressive enough to unveil the quirks of the models' patterns. We apply our framework to the software engineering evaluation dataset SWE-Bench to study the structural distinctness of agent trajectories and find that behavior is most similar between models from similar release periods and those that are distilled from one another (e.g., a distilled student model and its teacher have a Jensen-Shannon divergence of 0.25, about half the distance between other model pairs). As more models saturate evaluations, we believe that it will be important to probe model behavior along more holistic dimensions than success rates alone. We introduce ProcGrep, a library for auditing and evaluating agents for how they approach tasks at a procedural level given their traces in a top-down fashion. We believe this work has a range of applications to help developers work with and program coding agents, such as task-aware model routing, agent monitoring, and finer-grained cost analysis.
翻译:基准分数能告诉你代理做对了什么,却无法揭示其达成过程。本文提出在不同场景下(模型、任务与方法各异)对代理进行程序化比较的方法。我们比较了十个代理,发现其行为习惯具有可识别性,并将这些习惯定义为指纹:通过探测这些程序化特征,能够以85.7%的准确率将未知轨迹归因至正确的代理(在控制任务间信息泄露的前提下)。我们采用一种最大化压缩性、同时具备足够表现力以揭示模型模式特性的新兴词汇归纳技术,构建了代理问题解决过程的程序化表征。将该框架应用于软件工程评估数据集SWE-Bench,研究代理轨迹的结构独特性后发现:发布时期相近的模型,以及相互蒸馏的模型(例如,蒸馏学生模型与其教师模型的詹森-香农散度为0.25,约为其他模型对间距的一半),其行为最为相似。随着更多模型在评估中达到饱和,我们认为有必要从比单纯成功率更全面的维度探查模型行为。我们发布了ProcGrep库,该库提供基于代理轨迹的自顶向下审计与评估方法,聚焦其任务处理流程。本工作有望在任务感知型模型路由、代理监控及精细化成本分析等领域,为开发者协作与编程代理开发提供一系列应用支持。