Large language models (LLMs) have shown strong performance on automated software engineering tasks, yet existing benchmarks focus primarily on general-purpose libraries or web applications, leaving mobile application development largely unexplored despite its strict platform constraints, framework-driven lifecycles, and complex platform API interactions. We introduce MobileDev-Bench, a benchmark comprising 384 real-world issue-resolution tasks collected from 18 production mobile applications spanning Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart). Each task pairs an authentic developer-reported issue with executable test patches, enabling fully automated validation of model-generated fixes within mobile build environments. The benchmark exhibits substantial patch complexity: fixes modify 12.5 files and 324.9 lines on average, and 35.7% of instances require coordinated changes across multiple artifact types, such as source and manifest files. Evaluation of four state-of-the-art code-capable LLMs, GPT- 5.2, Claude Sonnet 4.5, Gemini Flash 2.5, and Qwen3-Coder, yields low end-to-end resolution rates of 3.39%-5.21%, revealing significant performance gaps compared to prior benchmarks. Further analysis reveals systematic failure modes, with fault localization across multi-file and multi-artifact changes emerging as the primary bottleneck.
翻译:大型语言模型(LLMs)在自动化软件工程任务中展现出强劲性能,然而现有基准主要聚焦于通用库或网络应用程序,对移动应用开发领域——因其严格的平台约束、框架驱动的生命周期及复杂的平台API交互——仍鲜有探索。我们提出MobileDev-Bench,该基准包含来自18个生产级移动应用的384项真实世界问题解决任务,涵盖Android原生(Java/Kotlin)、React Native(TypeScript)及Flutter(Dart)。每项任务将真实的开发者报告问题与可执行测试补丁配对,从而在移动构建环境中实现对模型生成修复方案的完全自动化验证。该基准展现出显著的补丁复杂性:修复方案平均修改12.5个文件和324.9行代码,其中35.7%的实例需要跨多种制品类型(如源文件和清单文件)进行协调更改。对四种最先进代码能力LLM(GPT-5.2、Claude Sonnet 4.5、Gemini Flash 2.5及Qwen3-Coder)的评估显示,其端到端解决率仅为3.39%-5.21%,揭示了相较于先前基准的性能缺口。进一步分析发现系统性失败模式,其中跨多文件及多制品的故障定位成为主要瓶颈。