Modern mobile applications rely on hidden interactions--gestures without visual cues like long presses and swipes--to provide functionality without cluttering interfaces. While experienced users may discover these interactions through prior use or onboarding tutorials, their implicit nature makes them difficult for most users to uncover. Similarly, mobile agents--systems designed to automate tasks on mobile user interfaces, powered by vision language models (VLMs)--struggle to detect veiled interactions or determine actions for completing tasks. To address this challenge, we present GhostUI, a new dataset designed to enable the detection of hidden interactions in mobile applications. GhostUI provides before-and-after screenshots, simplified view hierarchies, gesture metadata, and task descriptions, allowing VLMs to better recognize concealed gestures and anticipate post-interaction states. Quantitative evaluations with VLMs show that models fine-tuned on GhostUI outperform baseline VLMs, particularly in predicting hidden interactions and inferring post-interaction screens, underscoring GhostUI's potential as a foundation for advancing mobile task automation.
翻译:现代移动应用程序依赖隐藏交互(如长按和滑动等无视觉提示的手势)来提供功能,同时避免界面杂乱。虽然经验丰富的用户可能通过先前使用或入门教程发现这些交互,但其隐含特性使得大多数用户难以察觉。同样,由视觉语言模型驱动的移动代理(旨在自动化移动用户界面任务的系统)也难以检测隐蔽交互或确定完成任务所需的操作。为应对这一挑战,我们提出了GhostUI——一个专为检测移动应用中隐藏交互而设计的新数据集。GhostUI提供前后对比截图、简化的视图层级结构、手势元数据及任务描述,使视觉语言模型能更准确地识别隐蔽手势并预测交互后状态。基于视觉语言模型的定量评估表明,在GhostUI上微调的模型显著优于基线视觉语言模型,尤其在预测隐藏交互和推断交互后界面方面表现突出,这印证了GhostUI作为推进移动任务自动化基础资源的潜力。