PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

Computer-using agents drive real software through the screen -- clicking and typing -- but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again. We present PreAct, which lets such an agent get faster on tasks it has done before. The first time it succeeds, PreAct compiles the run into a small state-machine program-states that check the screen, transitions that act-and on later runs replays it directly instead of invoking the agent 8.5-13x faster, with no per-step language-model calls. Replay is not blind: at each step PreAct checks that the screen matches what the program expects before acting, and hands control back to the agent the moment something is off. PreAct applies the same discipline when deciding what to keep: a freshly compiled program enters the store only if, re-run from a clean state, an independent evaluator confirms it solved the task-catching programs that replay to their last step yet leave the task undone. Across a mobile, a desktop, and a web benchmark, this store-time check separates repeated runs that improve from ones that degrade as faulty programs accumulate, worth 1.75-2.6 tasks per benchmark, the same direction on all three; a fallback that explores afresh when no program fits brings PreAct level with a strong record-and-replay baseline. We also report what did not matter: prompt wording, runtime guardrails, and whether a language model or a plain embedding retriever selects which program to reuse.

翻译：计算机操作智能体通过屏幕直接操控真实软件——包括点击与键入操作——但每次任务均需从头解决：当被要求重复执行某项任务时，智能体会重新读取屏幕内容、重新推导每个点击动作，并再次付出完整计算成本。本文提出PreAct方法，使此类智能体在执行已完成的重复任务时实现速度提升。首次成功完成任务后，PreAct会将运行过程编译为小型状态机程序——状态负责检测屏幕状态，迁移对应操作行为——后续执行时直接回放该程序，无需调用智能体，速度提升8.5-13倍，且无需逐步骤调用语言模型。回放过程并非盲目执行：每个步骤中，PreAct在操作前会验证屏幕状态是否符合程序预期，一旦发现异常立即将控制权交还给智能体。PreAct在决定保留哪些内容时采用相同准则：仅当从干净状态重新运行时，独立评估器确认新编译程序能成功完成任务后，该程序才会存入存储库——此举可捕获那些能回放至最后步骤却未完成任务的程序。在移动端、桌面端和网页端三个基准测试中，这种存储时验证机制能有效区分：当错误程序积累时，重复运行性能提升与性能下降的界限，每个基准测试带来1.75-2.6个任务增益，且三种场景方向一致；当无匹配程序时，采用全新探索的降级策略使PreAct达到与强记录-回放基线相当的水平。我们还报告了无关因素：提示措辞、运行时防护栏设置，以及选择复用程序时采用语言模型还是简单嵌入检索器均不产生显著影响。