Autonomous agents that interact with graphical user interfaces (GUIs) hold significant potential for enhancing user experiences. To further improve these experiences, agents need to be personalized and proactive. By effectively comprehending user intentions through their actions and interactions with GUIs, agents will be better positioned to achieve these goals. This paper introduces the task of goal identification from observed UI trajectories, aiming to infer the user's intended task based on their GUI interactions. We propose a novel evaluation metric to assess whether two task descriptions are paraphrases within a specific UI environment. By Leveraging the inverse relation with the UI automation task, we utilized the Android-In-The-Wild and Mind2Web datasets for our experiments. Using our metric and these datasets, we conducted several experiments comparing the performance of humans and state-of-the-art models, specifically GPT-4 and Gemini-1.5 Pro. Our results show that Gemini performs better than GPT but still underperforms compared to humans, indicating significant room for improvement.
翻译:与图形用户界面(GUI)交互的自主代理在提升用户体验方面具有巨大潜力。为实现更优体验,代理需具备个性化和主动性。通过有效理解用户在与GUI交互过程中的行为意图,代理将能更好地达成这些目标。本文提出了从观测到的UI轨迹中识别目标的任务,旨在基于用户的GUI交互推断其预期任务。我们提出了一种新颖的评估指标,用于判断在特定UI环境中两个任务描述是否为同义表达。借助与UI自动化任务的逆关系,我们采用Android-In-The-Wild和Mind2Web数据集进行实验。基于该指标及数据集,我们开展了多组实验,对比人类与前沿模型(特别是GPT-4和Gemini-1.5 Pro)的表现。结果表明,Gemini性能优于GPT,但仍逊于人类,这显示现有模型仍有显著改进空间。