Mobile GUI agents are becoming critical tools to improve user experience on smart devices, with multimodal large language models (MLLMs) emerging as the dominant paradigms in this domain. Current agents, however, rely on explicit human instructions, overlooking the potential to leverage the contextual information (like location, time, user profile) and historical data for proactive task suggestions. Besides, previous works focus on optimizing the success rate during task execution, but pay less attention to the personalized execution trajectory, thereby neglecting potentially vast differences in user preferences. To address these challenges, we introduce the FingerTip 20K benchmark. We collected 20K unique human demonstrations of multi-step Android device interactions across a variety of everyday apps. These demonstrations are not isolated but are continuously acquired from the users' long-term usage in their real lives, and encompass essential user-related contextual information. The benchmark contains two new tracks: proactive task suggestions by analyzing environment observation and users' previous intents, and personalized task execution by catering to users' action preferences. Our experiments reveal that the tracks we propose pose significant challenges for leveraging user-related information in GUI tasks. We also performed a human study to show that there exists a huge gap between existing agents and humans. The model fine-tuned with the data we collected effectively utilized user information and achieved good results, highlighting the potential of our approach in building more user-oriented mobile LLM agents. Our code is open-source at https://github.com/tsinghua-fib-lab/FingerTip-20K for reproducibility.
翻译:移动图形用户界面(GUI)智能体正日益成为提升智能设备用户体验的关键工具,其中多模态大语言模型(MLLMs)已成为该领域的主导范式。然而,现有智能体依赖于显式的人类指令,忽视了利用上下文信息(如位置、时间、用户画像)和历史数据进行主动任务推荐的潜力。此外,先前的研究侧重于优化任务执行过程中的成功率,却较少关注个性化的执行轨迹,从而忽略了用户偏好之间可能存在的巨大差异。为应对这些挑战,我们提出了FingerTip 20K基准。我们收集了涵盖多种日常应用程序的20K条独特的人类多步Android设备交互演示。这些演示并非孤立存在,而是持续采集自用户在真实生活中的长期使用记录,并包含了关键的用户相关上下文信息。该基准包含两个新任务:通过分析环境观察和用户先前意图进行主动任务推荐,以及通过适应用户操作偏好实现个性化任务执行。我们的实验表明,所提出的任务对在GUI任务中利用用户相关信息提出了重大挑战。我们还进行了一项人工研究,结果表明现有智能体与人类之间存在巨大差距。利用我们收集的数据进行微调的模型有效利用了用户信息并取得了良好效果,凸显了我们的方法在构建更以用户为中心的移动大语言模型智能体方面的潜力。我们的代码已在 https://github.com/tsinghua-fib-lab/FingerTip-20K 开源,以确保可复现性。