LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of Measuring Agents in Production, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 86 deployed systems practitioners across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and their top development challenges. Our study finds that production agents are built using simple, controllable approaches: 68% execute at most 10 steps before human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability (consistent correct behavior over time) remains the top development challenge, which practitioners currently address through systems-level design. MAP documents the current state of production agents, providing the research community with visibility into deployment realities and underexplored research avenues.
翻译:基于大语言模型的智能体已在多个行业的生产环境中运行,但我们对其成功部署的技术方法仍缺乏系统性认知。我们首次提出生产环境中智能体度量(MAP)的系统性研究,采用来自智能体开发者的一手数据。通过深度访谈开展20项案例研究,并对涵盖26个领域的86个已部署系统从业者进行问卷调查,我们深入探究了组织构建智能体的动因、构建方法、评估方式及主要开发挑战。研究发现,生产环境中的智能体采用简单可控的方法构建:68%在人工干预前执行不超过10个步骤,70%依赖对现成模型的提示工程而非权重微调,74%主要依赖人工评估。可靠性(随时间保持稳定正确行为的能力)仍是首要开发挑战,目前从业者主要通过系统级设计加以应对。MAP系统记录了生产智能体的当前状态,为研究界提供了部署现状的可见性,并揭示了亟待探索的研究方向。