Can large language model agents develop industry-level mobile applications? We introduce \textbf{SWE-Bench Mobile}, a benchmark for evaluating coding agents on realistic software engineering tasks derived from a production iOS codebase. Unlike existing benchmarks that focus on isolated problems or bug fixes, SWE-Bench Mobile captures the full complexity of industrial development: multi-modal inputs (PRDs and Figma designs), a large-scale mixed Swift/Objective-C codebase, and comprehensive test suites. We evaluate 22 agent-model configurations across four coding agents -- three commercial (Cursor, Codex, Claude Code) and one open-source (OpenCode) -- and find that even the best configurations achieve only 12\% task success rate. Our analysis reveals that (1) agent design matters as much as model capability -- the same model shows up to 6$\times$ performance gap across agents, (2) commercial agents consistently outperform open-source alternatives, and (3) simple ``Defensive Programming'' prompts outperform complex ones by 7.4\%. These findings highlight a significant gap between current agent capabilities and industrial requirements, while providing actionable insights for practitioners and researchers. We release SWE-Bench Mobile as a \textit{hosted benchmark challenge} to prevent data contamination and ensure fair evaluation. The public leaderboard and development toolkit are available at https://swebenchmobile.com.
翻译:大型语言模型智能体能否开发工业级移动应用?我们提出了 **SWE-Bench Mobile**,这是一个基于实际生产环境iOS代码库衍生的软件工程任务,用于评估编码智能体性能的基准测试。与现有专注于孤立问题或缺陷修复的基准不同,SWE-Bench Mobile 全面捕捉了工业开发的复杂性:多模态输入(产品需求文档与Figma设计稿)、大规模混合Swift/Objective-C代码库以及完整的测试套件。我们评估了四种编码智能体(三种商业智能体:Cursor、Codex、Claude Code,一种开源智能体:OpenCode)共22种智能体-模型配置,发现即使最优配置的任务成功率也仅为12%。我们的分析表明:(1)智能体设计与模型能力同等重要——相同模型在不同智能体上的性能差距高达6倍;(2)商业智能体持续优于开源替代方案;(3)简单的“防御性编程”提示词比复杂提示词性能提升7.4%。这些发现揭示了当前智能体能力与工业需求之间的显著差距,同时为从业者和研究者提供了可操作的见解。我们将SWE-Bench Mobile发布为**托管式基准挑战**,以防止数据污染并确保公平评估。公开排行榜与开发工具包详见 https://swebenchmobile.com。