Graphical User Interface (GUI) agent is pivotal to advancing intelligent human-computer interaction paradigms. Constructing powerful GUI agents necessitates the large-scale annotation of high-quality user-behavior trajectory data (i.e., intent-trajectory pairs) for training. However, manual annotation methods and current GUI agent data mining approaches typically face three critical challenges: high construction cost, poor data quality, and low data richness. To address these issues, we propose M$^2$-Miner, the first low-cost and automated mobile GUI agent data-mining framework based on Monte Carlo Tree Search (MCTS). For better data mining efficiency and quality, we present a collaborative multi-agent framework, comprising InferAgent, OrchestraAgent, and JudgeAgent for guidance, acceleration, and evaluation. To further enhance the efficiency of mining and enrich intent diversity, we design an intent recycling strategy to extract extra valuable interaction trajectories. Additionally, a progressive model-in-the-loop training strategy is introduced to improve the success rate of data mining. Extensive experiments have demonstrated that the GUI agent fine-tuned using our mined data achieves state-of-the-art performance on several commonly used mobile GUI benchmarks. Our work will be released to facilitate the community research.
翻译:图形用户界面(GUI)智能体对于推进智能人机交互范式至关重要。构建强大的GUI智能体需要大规模标注高质量的用户行为轨迹数据(即意图-轨迹对)用于训练。然而,人工标注方法和当前的GUI智能体数据挖掘方法通常面临三个关键挑战:构建成本高、数据质量差以及数据丰富度低。为解决这些问题,我们提出了M$^2$-Miner,这是首个基于蒙特卡洛树搜索(MCTS)的低成本、自动化移动GUI智能体数据挖掘框架。为了获得更好的数据挖掘效率与质量,我们提出了一个协作式多智能体框架,包含用于指导、加速和评估的InferAgent、OrchestraAgent与JudgeAgent。为了进一步提升挖掘效率并丰富意图多样性,我们设计了一种意图回收策略以提取额外的有价值的交互轨迹。此外,我们引入了一种渐进式模型在环训练策略,以提高数据挖掘的成功率。大量实验表明,使用我们挖掘的数据进行微调的GUI智能体,在多个常用移动GUI基准测试中取得了最先进的性能。我们的工作将开源以促进社区研究。