LLM-based mobile GUI agents treat every task invocation as an independent reasoning episode, requiring a full LLM inference call at each action step. This per-step dependence makes them stateless: a task completed successfully yesterday is re-derived from scratch today, with no improvement in reliability or speed. We present SkillDroid, a three-layer skill agent that compiles successful LLM-guided GUI trajectories into parameterized skill templates (sequences of UI actions with weighted element locators and typed parameter slots) and replays them on future invocations without any LLM calls. A matching cascade (regex patterns, embedding similarity, and app filtering) routes incoming instructions to stored skills, while a failure-learning layer triggers recompilation when skill reliability degrades. Over a 150-round longitudinal evaluation with systematic instruction variation and controlled perturbations, SkillDroid achieves an 85.3% success rate (23 percentage points above a stateless LLM baseline) while using 49% fewer LLM calls. The skill replay mechanism achieves a perfect 1000% success rate across 79 replay rounds at 2.4 times the speed of full LLM execution. Most critically, the system improves with use: its success rate converges upward from 87% to 91%, while the baseline degrades from 80% to 44%.
翻译:基于大语言模型的移动图形用户界面代理将每次任务调用视为独立推理过程,每个动作步骤都需要完整的大语言模型推理调用。这种逐步骤依赖导致其处于无状态状态:昨日成功完成的任务今日需要从头重新推导,可靠性与速度均无提升。本文提出SkillDroid——一种三层技能代理架构,将成功的大语言模型引导图形用户界面轨迹编译为参数化技能模板(带加权元素定位器与类型化参数槽的界面操作序列),并在后续调用中无需任何大语言模型调用即可重放。匹配级联机制(正则模式、嵌入相似度、应用过滤)将输入指令路由至存储技能,而故障学习层在技能可靠性下降时触发重新编译。经过150轮系统化指令变异与可控扰动的纵向评估,SkillDroid实现了85.3%的成功率(较无状态大语言模型基线提升23个百分点),同时减少49%的大语言模型调用。技能重放机制在79轮重放测试中达成1000%完美成功率,执行速度达到完整大语言模型方案的2.4倍。最关键在于,该系统随使用持续改进:其成功率从87%向上收敛至91%,而基线则从80%降至44%。