Large language model (LLM) agents excel at solving complex long-horizon tasks through autonomous interaction with environments. However, their real-world deployment faces a fundamental device--cloud dilemma: on-device models are efficient but often brittle, while cloud models are stronger but costly in computation. State-of-the-art LLM device--cloud routers usually make coarse task-level decisions, which cannot adapt to the changing difficulty of multi-step agent interactions. To address this issue, we present Hera, a step-level device--cloud LLM agent coordinator for long-horizon tasks achieving a strong performance--cost Pareto frontier. Hera adopts a novel two-stage training paradigm: (1) imitation learning for cold-start, followed by (2) reinforcement learning that jointly optimizes task success and cloud usage efficiency. The first stage casts step-level routing as a supervised classification problem: the device agent is replayed on cloud trajectories, with each state labeled by the agreement between device and cloud actions. In the second stage, we perform cost-aware reinforcement learning by grouping identical states across trajectories and updating Hera with labels favoring higher expected return and fewer future cloud calls. We evaluate Hera on ALFWorld, WebShop, and AppWorld, where it consistently outperforms prior methods, achieving 92.5% of the cloud-only success rate with cloud use in only 46.3% of steps.
翻译:大语言模型智能体通过自主与环境交互,在解决复杂长周期任务方面表现出色。然而,其实际部署面临基本的设备-云端困境:端侧模型高效但易出错,云端模型强大但计算成本高昂。当前最先进的LLM设备-云端路由器通常进行粗粒度的任务级决策,无法适应多步骤智能体交互中动态变化的难度。为此,我们提出赫拉——一种面向长周期任务的步骤级设备-云端LLM智能体协调器,实现了性能与成本的最优帕累托前沿。赫拉采用新颖的两阶段训练范式:(1) 用于冷启动的模仿学习,随后(2) 联合优化任务成功率和云端使用效率的强化学习。第一阶段将步骤级路由建模为有监督分类问题:端侧智能体在云端轨迹上回放,每个状态根据端侧与云端动作的一致性进行标注。第二阶段采用成本感知强化学习,通过聚合不同轨迹中的相同状态,并基于偏好更高期望收益和更少未来云端调用的标签更新赫拉。我们在ALFWorld、WebShop和AppWorld上评估赫拉,其始终优于先前方法,在仅46.3%步骤使用云端的情况下,实现了云端独占方案92.5%的成功率。