DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents

On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users' requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control presents significant challenges due to limited data availability and inefficient online training processes. This paper introduces DistRL, a novel framework designed to enhance the efficiency of online RL fine-tuning for mobile device control agents. DistRL employs centralized training and decentralized data acquisition to ensure efficient fine-tuning in the context of dynamic online interactions. Additionally, the framework is backed by our tailor-made RL algorithm, which effectively balances exploration with the prioritized utilization of collected data to ensure stable and robust training. Our experiments show that, on average, DistRL delivers a 3X improvement in training efficiency and enables training data collection 2.4X faster than the leading synchronous multi-machine methods. Notably, after training, DistRL achieves a 20% relative improvement in success rate compared to state-of-the-art methods on general Android tasks from an open benchmark, significantly outperforming existing approaches while maintaining the same training time. These results validate DistRL as a scalable and efficient solution, offering substantial improvements in both training efficiency and agent performance for real-world, in-the-wild device control tasks.

翻译：设备端控制代理，尤其是在移动设备上，负责操作移动设备以满足用户请求，实现无缝且直观的交互。将多模态大语言模型（MLLMs）集成到这些代理中，增强了其理解和执行复杂指令的能力，从而提升了用户体验。然而，由于可用数据有限且在线训练过程效率低下，为设备端控制微调MLLMs面临重大挑战。本文提出了DistRL，一个旨在提升移动设备控制代理在线强化学习微调效率的新型框架。DistRL采用集中式训练与分布式数据采集，以确保在动态在线交互背景下实现高效微调。此外，该框架由我们定制的强化学习算法提供支持，该算法有效平衡了探索与对收集数据的优先级利用，从而确保了稳定且鲁棒的训练。我们的实验表明，平均而言，DistRL实现了训练效率3倍的提升，并且训练数据采集速度比领先的同步多机方法快2.4倍。值得注意的是，训练完成后，在一个开放基准测试的通用Android任务上，DistRL相比最先进方法实现了20%的相对成功率提升，在保持相同训练时间的同时显著优于现有方法。这些结果验证了DistRL作为一个可扩展且高效的解决方案，在实际开放环境设备控制任务中，在训练效率和代理性能方面均提供了实质性改进。