We present Minitap, a multi-agent system that achieves 100% success on the AndroidWorld benchmark, the first to fully solve all 116 tasks and surpassing human performance (80%). We first analyze why single-agent architectures fail: context pollution from mixed reasoning traces, silent text input failures undetected by the agent, and repetitive action loops without escape. Minitap addresses each failure through targeted mechanisms: cognitive separation across six specialized agents, deterministic post-validation of text input against device state, and meta-cognitive reasoning that detects cycles and triggers strategy changes. Ablations show multi-agent decomposition contributes +21 points over single-agent baselines; verified execution adds +7 points; meta-cognition adds +9 points. We release Minitap as open-source software. https://github.com/minitap-ai/mobile-use
翻译:我们提出了Minitap,这是一个在AndroidWorld基准测试中实现100%成功率的多智能体系统,首次完全解决了全部116项任务并超越了人类表现(80%)。我们首先分析了单智能体架构失败的原因:混合推理轨迹导致的上下文污染、智能体无法检测的静默文本输入失败,以及无法逃脱的重复动作循环。Minitap通过针对性机制解决每种失败:六个专业智能体间的认知分离、基于设备状态对文本输入的确定性后验证,以及检测循环并触发策略改变的元认知推理。消融实验表明,多智能体分解相比单智能体基线贡献了+21个百分点的提升;验证执行贡献了+7个百分点;元认知贡献了+9个百分点。我们将Minitap作为开源软件发布。https://github.com/minitap-ai/mobile-use