PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

Chenxin Li,Zhengyao Fang,Zhengyang Tang,Pengyuan Lyu,Xingran Zhou,Xin Lai,Fei Tang,Liang Wu,Yiduo Guo,Weinong Wang,Junyi Li,Yi Zhang,Yang Ding,Huawen Shen,Sunqi Fan,Shangpin Peng,Zheng Ruan,Anran Zhang,Benyou Wang,Chengquan Zhang,Han Hu

from arxiv, Project Page: https://phoneharness.github.io/

Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.

翻译：手机代理日益被期望完成真实的移动工作流，而不仅仅是预测下一个屏幕操作。然而，当前移动代理文献中，多数评估仍将代理视为仅能观察屏幕、发出点击和滑动操作、并通过目标应用状态评分的图形用户界面（GUI）控制器。实际手机使用任务范围更广：它们需要决定何时使用应用GUI、设备端命令或结构化工具，同时留下预期副作用实际发生的证据。我们提出PhoneHarness，这是一个用于研究手机代理在可验证移动工作流上的混合操作基准与执行框架。PhoneHarness运行一个设备端代理循环，涵盖GUI、命令行界面（CLI）与主机端工具操作，结合确定性操作路由、受限GUI委托以及可审计的执行轨迹。其基准测试PhoneHarness Bench评估代理是否完成具有可观察副作用的任务，而不仅仅是生成合理的最终答案。在标注的评估划分中，PhoneHarness达到75.0%的通过率，比非PhoneHarness设置中最强的基线高出12.9个百分点。因此，PhoneHarness与PhoneHarness Bench扮演着截然不同但相互依赖的角色：框架使混合手机工作流可执行，而基准测试则衡量代理能否可靠且安全地使用该框架。我们的研究结果表明，可靠的手机自动化依赖于操作表面路由与可验证执行，而不仅仅是视觉GUI控制。