DistRL：面向设备端控制智能体的异步分布式强化学习框架 (DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents)

On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users' requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control presents significant challenges due to limited data availability and inefficient online training processes. This paper introduces DistRL, a novel framework designed to enhance the efficiency of online RL fine-tuning for mobile device control agents. DistRL employs centralized training and decentralized data acquisition to ensure efficient fine-tuning in the context of dynamic online interactions. Additionally, the framework is backed by our tailor-made RL algorithm, which effectively balances exploration with the prioritized utilization of collected data to ensure stable and robust training. Our experiments show that, on average, DistRL delivers a 3X improvement in training efficiency and enables training data collection 2.4X faster than the leading synchronous multi-machine methods. Notably, after training, DistRL achieves a 20% relative improvement in success rate compared to state-of-the-art methods on general Android tasks from an open benchmark, significantly outperforming existing approaches while maintaining the same training time. These results validate DistRL as a scalable and efficient solution, offering substantial improvements in both training efficiency and agent performance for real-world, in-the-wild device control tasks.

翻译：设备端控制智能体，特别是在移动设备上，负责操作移动设备以满足用户请求，实现无缝且直观的交互。将多模态大语言模型（MLLMs）集成到此类智能体中，增强了其理解和执行复杂指令的能力，从而提升用户体验。然而，由于可用数据有限以及在线训练过程效率低下，为设备端控制任务微调MLLMs面临重大挑战。本文提出了DistRL，一种旨在提升移动设备控制智能体在线强化学习微调效率的新型框架。DistRL采用集中式训练与分布式数据采集相结合的方式，以确保在动态在线交互背景下实现高效微调。此外，该框架由我们定制的强化学习算法提供支持，该算法有效平衡了探索与对收集数据的优先级利用，从而确保训练过程的稳定性和鲁棒性。实验结果表明，平均而言，DistRL的训练效率提升了3倍，并且训练数据采集速度比领先的同步多机方法快2.4倍。值得注意的是，训练完成后，在一个开放基准测试的通用Android任务上，DistRL的成功率相较于现有最先进方法实现了20%的相对提升，在保持相同训练时间的同时显著优于现有方法。这些结果验证了DistRL作为一种可扩展且高效的解决方案，为现实世界、开放环境下的设备控制任务在训练效率和智能体性能方面均带来了实质性改进。