Video recordings of user activities, particularly desktop recordings, offer a rich source of data for understanding user behaviors and automating processes. However, despite advancements in Vision-Language Models (VLMs) and their increasing use in video analysis, extracting user actions from desktop recordings remains an underexplored area. This paper addresses this gap by proposing two novel VLM-based methods for user action extraction: the Direct Frame-Based Approach (DF), which inputs sampled frames directly into VLMs, and the Differential Frame-Based Approach (DiffF), which incorporates explicit frame differences detected via computer vision techniques. We evaluate these methods using a basic self-curated dataset and an advanced benchmark adapted from prior work. Our results show that the DF approach achieves an accuracy of 70% to 80% in identifying user actions, with the extracted action sequences being re-playable though Robotic Process Automation. We find that while VLMs show potential, incorporating explicit UI changes can degrade performance, making the DF approach more reliable. This work represents the first application of VLMs for extracting user action sequences from desktop recordings, contributing new methods, benchmarks, and insights for future research.
翻译:用户活动(特别是桌面录制)的视频记录为理解用户行为和自动化流程提供了丰富的数据来源。然而,尽管视觉语言模型(VLMs)取得了进展且在视频分析中的应用日益增多,从桌面录制中提取用户操作仍是一个尚未充分探索的领域。本文通过提出两种基于VLM的新型用户操作提取方法来填补这一空白:直接基于帧的方法(DF),该方法将采样帧直接输入VLM;以及差分帧方法(DiffF),该方法结合了通过计算机视觉技术检测到的显式帧间差异。我们使用一个自建的基础数据集和一个基于先前工作改编的高级基准来评估这些方法。结果表明,DF方法在识别用户操作方面达到了70%至80%的准确率,且提取出的操作序列可通过机器人流程自动化进行回放。我们发现,尽管VLM显示出潜力,但引入显式的用户界面变化反而可能降低性能,这使得DF方法更为可靠。本研究首次将VLM应用于从桌面录制中提取用户操作序列,为未来研究贡献了新的方法、基准和见解。