LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of Measuring Agents in Production, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 306 practitioners across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and their top development challenges. Our study finds that production agents are built using simple, controllable approaches: 68% execute at most 10 steps before human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability (consistent correct behavior over time) remains the top development challenge, which practitioners currently address through systems-level design. MAP documents the current state of production agents, providing the research community with visibility into deployment realities and under-explored research avenues.
翻译:基于大型语言模型(LLM)的智能体已在众多行业的生产环境中运行,然而我们对于何种技术方法能够确保部署成功仍缺乏深入理解。本文首次通过智能体开发者的一手数据,系统性地开展了生产环境智能体评估(MAP)研究。我们通过深度访谈完成了20个案例研究,并对涵盖26个领域的306名从业者进行了问卷调查。我们深入探究了组织构建智能体的动因、构建方式、评估方法及其面临的主要开发挑战。研究发现,生产环境中的智能体普遍采用简单可控的构建方式:68%的智能体在人工干预前最多执行10个步骤,70%依赖对现成模型的提示工程而非权重调优,74%主要依靠人工评估。可靠性(长期保持行为一致性与正确性)仍是首要开发挑战,从业者目前主要通过系统级设计加以应对。MAP研究记录了生产环境智能体的当前发展态势,为学术界揭示了实际部署现状与尚未充分探索的研究方向。