AI agents operating on user interfaces must understand how interfaces communicate state and feedback to act reliably. As a core communicative modality, animations are increasingly used in modern interfaces, serving critical functional purposes beyond mere aesthetics. Thus, understanding UI animation is essential for comprehensive interface interpretation. However, recent studies of Vision Language Models (VLMs) for UI understanding have focused primarily on static screenshots, leaving it unclear how well these models handle dynamic UI animations. To address this gap, we created AniMINT, a novel dataset of 300 densely annotated UI animation videos. We systematically evaluate state-of-the-art VLMs on UI animation understanding, including their abilities to perceive the animation effects, identify animation purposes, and interpret animation meaning. Our results show that VLMs can reliably detect primitive motion. However, their high-level animation interpretation remains inconsistent, with substantial gaps relative to human performance. Finally, we use Motion, Context, and Perceptual Cues (MCPC) to probe factors affecting VLM performance, revealing key bottlenecks and directions for future improvement.
翻译:在用户界面上操作的AI代理必须理解界面如何通过状态和反馈进行通信,以可靠地执行操作。作为一种核心的沟通方式,动画在现代界面中越来越常见,其功能目的远不止于美学。因此,理解用户界面动画对于全面的界面解读至关重要。然而,近期关于视觉语言模型(VLM)在用户界面理解方面的研究主要集中在静态截图上,尚不清楚这些模型如何处理动态用户界面动画。为填补这一空白,我们创建了AniMINT,这是一个包含300个密集标注的UI动画视频的新型数据集。我们系统评估了最先进的VLM在UI动画理解方面的能力,包括感知动画效果、识别动画目的以及解读动画含义。结果表明,VLM能够可靠地检测基本运动。然而,它们对高级动画的解释仍然不一致,与人类表现存在显著差距。最后,我们利用运动、上下文和感知线索(MCPC)来探究影响VLM性能的因素,揭示了关键瓶颈和未来改进方向。