Scaling Mobile Chaos Testing with AI-Driven Test Execution

Juan Marcano,Ashish Samant,Kai Song,Lingchao Chen,Kaelan Mikowicz,Tim Smyth,Mengdie Zhang,Ali Zamani,Arturo Bravo Rovirosa,Sowjanya Puligadda,Srikanth Prodduturi,Mayank Bansal

from arxiv, 10 pages of content, 1 page of citations, 7 figures, 6 tables

Mobile applications in large-scale distributed systems are susceptible to backend service failures, yet traditional chaos engineering approaches cannot scale mobile testing due to the combinatorial explosion of flows, locations, and failure scenarios that need validation. We present an automated mobile chaos testing system that integrates DragonCrawl, an LLM-based mobile testing platform, with uHavoc, a service-level fault injection system. The key insight is that adaptive AI-driven test execution can navigate mobile applications under degraded backend conditions, eliminating the need to manually write test cases for each combination of user flow, city, and failure type. Since Q1 2024, our system has executed over 180,000 automated chaos tests across 47 critical flows in Uber's Rider, Driver, and Eats applications, representing approximately 39,000 hours of manual testing effort that would be impractical at this scale. We identified 23 resilience risks, with 70% being architectural dependency violations where non-critical service failures degraded core user flows. Twelve issues were severe enough to prevent trip requests or food orders. Two caused application crashes detectable only through mobile chaos testing, not backend testing alone. Automated root cause analysis reduced debugging time from hours to minutes, achieving 88% precision@5 in attributing mobile failures to specific backend services. This paper presents the system design, evaluates its performance under fault injection (maintaining 99% test reliability), and reports operational experience demonstrating that continuous mobile resilience validation is achievable at production scale.

翻译：大规模分布式系统中的移动应用易受后端服务故障影响，然而传统的混沌工程方法因需验证的流程、地理位置和故障场景的组合爆炸而无法扩展移动测试。我们提出了一种自动化移动混沌测试系统，该系统将基于大语言模型的移动测试平台DragonCrawl与服务级故障注入系统uHavoc相集成。其核心洞见在于：自适应的人工智能驱动测试执行能够在后端服务降级条件下导航移动应用，从而无需为每种用户流程、城市和故障类型的组合手动编写测试用例。自2024年第一季度以来，我们的系统已在Uber的乘客端、司机端和外卖应用的47个关键流程中执行了超过18万次自动化混沌测试，相当于约3.9万小时的人工测试工作量，在此规模下进行人工测试是不切实际的。我们识别出23个弹性风险，其中70%属于架构依赖违规——即非关键服务故障导致核心用户流程性能下降。其中12个问题严重到足以阻碍行程预订或食品订单。有两个问题引发了仅通过移动混沌测试（而非仅后端测试）才能检测到的应用崩溃。自动化根因分析将调试时间从数小时缩短至数分钟，在将移动端故障归因于特定后端服务方面实现了88%的精确率@5。本文介绍了系统设计，评估了其在故障注入下的性能（保持99%的测试可靠性），并报告了实际运营经验，证明在生产规模下实现持续的移动弹性验证是可行的。