Beyond Syntax: Action Semantics Learning for App Agents

The recent development of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with proprietary LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current supervised fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition induced by the action in the user interface. Building on this insight, ASL employs a novel SEmantic Estimator~(SEE) to compute a semantic similarity to train the App agents in generating actions aligned with the semantics of ground truth actions, even when their syntactic forms differ. SEE is a flexible module that can be applied in both supervised and reinforcement fine-tuning paradigms. To support the effectiveness of ASL, we theoretically demonstrate the superior robustness of ASL for the OOD problem compared with the existing syntax learning paradigm. Extensive experiments across multiple offline and online benchmarks demonstrate that ASL significantly improves the accuracy and generalisation of App agents compared to existing methods.

翻译：大型语言模型（LLM）的最新发展推动了应用代理的兴起，这类代理能够解读用户意图并通过点击、滑动等操作控制智能手机应用。尽管基于提示词并调用专有LLM API的解决方案展现出良好潜力，但其计算成本高昂且依赖外部API。对较小规模开源LLM进行微调可解决这些限制。然而，当前监督微调方法采用语法学习范式，强制代理精确复现真实动作字符串，导致分布外（OOD）场景下的脆弱性。为填补这一空白，我们提出动作语义学习（ASL）这一新型学习框架，其学习目标在于捕捉真实动作的语义内涵。具体而言，受编程语言理论启发，我们将应用代理的动作语义定义为该动作在用户界面中引发的状态转移。基于这一洞见，ASL采用新型语义估计器（SEE）计算语义相似度，以训练应用代理生成与真实动作语义对齐的动作——即使其语法形式存在差异。SEE作为灵活模块，可同时适用于监督微调与强化微调范式。为验证ASL的有效性，我们从理论上证明了相较于现有语法学习范式，ASL对OOD问题具有更优的鲁棒性。在多个离线与在线基准测试上的广泛实验表明，ASL相比现有方法显著提升了应用代理的准确性与泛化能力。