Existing GUI agent models relying on coordinate-based one-step visual grounding struggle with generalizing to varying input resolutions and aspect ratios. Alternatives introduce coordinate-free strategies yet suffer from learning under severe data scarcity. To address the limitations, we propose ToolTok, a novel paradigm of multi-step pathfinding for GUI agents, where operations are modeled as a sequence of progressive tool usage. Specifically, we devise tools aligned with human interaction habits and represent each tool using learnable token embeddings. To enable efficient embedding learning under limited supervision, ToolTok introduces a semantic anchoring mechanism that grounds each tool with semantically related concepts as natural inductive bias. To further enable a pre-trained large language model to progressively acquire tool semantics, we construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding. Extensive experiments on multiple benchmarks show that ToolTok achieves superior performance among models of comparable scale (4B) and remains competitive with a substantially larger model (235B). Notably, these results are obtained using less than 1% of the training data required by other post-training approaches. In addition, ToolTok demonstrates strong generalization across unseen scenarios. Our training & inference code is open-source at https://github.com/ZephinueCode/ToolTok.
翻译:现有基于坐标的单步视觉定位的GUI代理模型难以泛化至不同输入分辨率和宽高比。替代方案引入了无坐标策略,但在严重数据稀缺下学习效果不佳。为解决这些局限性,我们提出ToolTok——一种面向GUI代理的多步路径查找新范式,将操作建模为渐进式工具使用的序列。具体而言,我们设计了符合人类交互习惯的工具,并使用可学习的标记嵌入表示每个工具。为在有限监督下实现高效嵌入学习,ToolTok引入了语义锚定机制,通过语义相关概念为每个工具提供自然归纳偏置。为进一步使预训练大语言模型逐步获取工具语义,我们构建了由易到难的课程体系,包含三个任务:标记定义问答、纯文本引导工具选择及简化视觉路径查找。在多个基准测试上的大量实验表明,ToolTok在同等规模模型(4B)中取得最优性能,并与显著更大的模型(235B)保持竞争力。值得注意的是,这些结果仅使用了其他后训练方法所需训练数据量的不到1%。此外,ToolTok在未见场景中展现出强大的泛化能力。我们的训练与推理代码已在https://github.com/ZephinueCode/ToolTok开源。