Mobile GUI agents can automate smartphone tasks by interacting directly with app interfaces, but how they should communicate with users during execution remains underexplored. Existing systems rely on two extremes: foreground execution, which maximizes transparency but prevents multitasking, and background execution, which supports multitasking but provides little visual awareness. Through iterative formative studies, we found that users prefer a hybrid model with just-in-time visual interaction, but the most effective visualization modality depends on the task. Motivated by this, we present AgentLens, a mobile GUI agent that adaptively uses three visual modalities during human-agent interaction: Full UI, Partial UI, and GenUI. AgentLens extends a standard mobile agent with adaptive communication actions and uses Virtual Display to enable background execution with selective visual overlays. In a controlled study with 21 participants, AgentLens was preferred by 85.7% of participants and achieved the highest usability (1.94 Overall PSSUQ) and adoption-intent (6.43/7).
翻译:移动GUI代理可通过直接操作应用界面来自动化智能手机任务,但它们在执行过程中如何与用户通信仍缺乏充分探索。现有系统依赖两种极端方式:前台执行能最大化透明度但阻碍多任务处理,后台执行虽支持多任务但视觉感知能力薄弱。通过迭代式形成性研究,我们发现用户偏好具有即时视觉交互的混合模式,但最有效的视觉模态取决于具体任务。基于此,我们提出AgentLens——一种在人类-代理交互中自适应使用三种视觉模态(完整UI、部分UI与生成UI)的移动GUI代理。AgentLens通过自适应通信动作扩展标准移动代理,并利用虚拟显示技术实现带选择性视觉叠加的后台执行。在21名参与者参与的对照研究中,85.7%的参与者更倾向选择AgentLens,其可用性(整体PSSUQ评分1.94)与采用意愿(6.43/7)均达最优水平。