As Large Language Models (LLMs) evolve from code generators into collaborative partners for software engineers, our methods for evaluation are lagging. Current benchmarks, focused on code correctness, fail to capture the nuanced, interactive behaviors essential for successful human-AI partnership. To bridge this evaluation gap, this paper makes two core contributions. First, we present a foundational taxonomy of desirable agent behaviors for enterprise software engineering, derived from an analysis of 91 sets of user-defined agent rules. This taxonomy defines four key expectations of agent behavior: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solving Problems Effectively, and Collaborating with the User. Second, recognizing that these expectations are not static, we introduce the Context-Adaptive Behavior (CAB) Framework. This emerging framework reveals how behavioral expectations shift along two empirically-derived axes: the Time Horizon (from immediate needs to future ideals), established through interviews with 15 expert engineers, and the Type of Work (from enterprise production to rapid prototyping, for example), identified through a prompt analysis of a prototyping agent. Together, these contributions offer a human-centered foundation for designing and evaluating the next generation of AI agents, moving the field's focus from the correctness of generated code toward the dynamics of true collaborative intelligence.
翻译:随着大型语言模型(LLM)从代码生成器演变为软件工程师的协作伙伴,我们的评估方法却相对滞后。当前以代码正确性为核心的基准测试,未能捕捉到成功人机协作所必需的、细致入微的交互行为。为弥合这一评估鸿沟,本文做出两项核心贡献。首先,我们通过对91组用户定义的代理规则进行分析,提出了一个面向企业软件工程所需代理行为的基础分类法。该分类法定义了代理行为的四个关键预期:遵循标准与流程、确保代码质量与可靠性、有效解决问题以及与用户协作。其次,认识到这些预期并非一成不变,我们引入了情境自适应行为框架。这一新兴框架揭示了行为预期如何沿着两个经验推导的维度发生动态变化:一是时间跨度(从即时需求到未来理想),这一维度通过对15位专家工程师的访谈确立;二是工作类型(例如,从企业级生产到快速原型开发),这一维度通过对一个原型开发代理的提示分析而识别。这些贡献共同为设计和评估下一代AI代理提供了一个人本主义基础,将领域焦点从生成代码的正确性转向真正协作智能的动态过程。