Exploring the Capabilities of Vision-Language Models to Detect Visual Bugs in HTML5 <canvas> Applications

The HyperText Markup Language 5 (HTML5) <canvas> is useful for creating visual-centric web applications. However, unlike traditional web applications, HTML5 <canvas> applications render objects onto the <canvas> bitmap without representing them in the Document Object Model (DOM). Mismatches between the expected and actual visual output of the <canvas> bitmap are termed visual bugs. Due to the visual-centric nature of <canvas> applications, visual bugs are important to detect because such bugs can render a <canvas> application useless. As we showed in prior work, Asset-Based graphics can provide the ground truth for a visual test oracle. However, many <canvas> applications procedurally generate their graphics. In this paper, we investigate how to detect visual bugs in <canvas> applications that use Procedural graphics as well. In particular, we explore the potential of Vision-Language Models (VLMs) to automatically detect visual bugs. Instead of defining an exact visual test oracle, information about the application's expected functionality (the context) can be provided with the screenshot as input to the VLM. To evaluate this approach, we constructed a dataset containing 80 bug-injected screenshots across four visual bug types (Layout, Rendering, Appearance, and State) plus 20 bug-free screenshots from 20 <canvas> applications. We ran experiments with a state-of-the-art VLM using several combinations of text and image context to describe each application's expected functionality. Our results show that by providing the application README(s), a description of visual bug types, and a bug-free screenshot as context, VLMs can be leveraged to detect visual bugs with up to 100% per-application accuracy.

翻译：超文本标记语言5（HTML5）的<canvas>元素对于创建以视觉为中心的Web应用十分有用。然而，与传统Web应用不同，HTML5 <canvas>应用将对象渲染到<canvas>位图上，而不在文档对象模型（DOM）中表示它们。<canvas>位图的预期视觉输出与实际输出之间的不匹配被称为视觉缺陷。由于<canvas>应用以视觉为中心的特性，检测视觉缺陷至关重要，因为此类缺陷可能导致<canvas>应用完全失效。正如我们在先前工作中所示，基于资产的图形可以为视觉测试预言提供真实基准。然而，许多<canvas>应用通过程序化方式生成其图形。本文研究了如何在使用程序化图形的<canvas>应用中检测视觉缺陷。具体而言，我们探索了利用视觉语言模型（VLM）自动检测视觉缺陷的潜力。无需定义精确的视觉测试预言，只需将关于应用预期功能的信息（上下文）与屏幕截图一同作为VLM的输入即可。为评估该方法，我们构建了一个数据集，包含来自20个<canvas>应用的80张注入缺陷的屏幕截图（涵盖四种视觉缺陷类型：布局、渲染、外观和状态）以及20张无缺陷屏幕截图。我们使用最先进的VLM进行了实验，尝试了多种文本与图像上下文的组合来描述每个应用的预期功能。结果表明，通过提供应用README文档、视觉缺陷类型描述以及无缺陷屏幕截图作为上下文，VLM检测视觉缺陷的单应用准确率最高可达100%。