Video-based quality assurance (QA) for long-form gameplay video is labor-intensive and error-prone, yet valuable for assessing game stability and visual correctness over extended play sessions. Vision language models (VLMs) promise general-purpose visual reasoning capabilities and thus appear attractive for detecting visual bugs directly from video frames. Recent benchmarks suggest that VLMs can achieve promising results in detecting visual glitches on curated datasets. Building on these findings, we conduct a real-world study using industrial QA gameplay videos to evaluate how well VLMs perform in practical scenarios. Our study samples keyframes from long gameplay videos and asks a VLM whether each keyframe contains a bug. Starting from a single-prompt baseline, the model achieves a precision of 0.50 and an accuracy of 0.72. We then examine two common enhancement strategies used to improve VLM performance without fine-tuning: (1) a secondary judge model that re-evaluates VLM outputs, and (2) metadata-augmented prompting through the retrieval of prior bug reports. Across \textbf{100 videos} totaling \textbf{41 hours} and \textbf{19,738 keyframes}, these strategies provide only marginal improvements over the simple baseline, while introducing additional computational cost and output variance. Our findings indicate that off-the-shelf VLMs are already capable of detecting a certain range of visual bugs in QA gameplay videos, but further progress likely requires hybrid approaches that better separate textual and visual anomaly detection.
翻译:基于视频的长时间游戏质量保证(QA)工作劳动密集且易出错,但在评估游戏稳定性及长时间运行的视觉正确性方面具有重要价值。视觉语言模型(VLM)具备通用视觉推理能力,因此直接通过视频帧检测视觉缺陷具有吸引力。近期基准测试表明,VLM在精选数据集上检测视觉故障时展现出了良好性能。基于这些发现,我们利用工业级QA游戏视频开展了一项真实场景研究,评估VLM在实际应用中的表现。本研究从长游戏视频中采样关键帧,并询问VLM每个关键帧是否包含缺陷。基于单提示基线,模型达到了0.50的精确率和0.72的准确率。随后,我们考察了两种无需微调即可提升VLM性能的常见增强策略:(1)设置辅助评判模型重新评估VLM输出;(2)通过检索历史缺陷报告构建元数据增强提示。在总计**41小时**的**100个视频**(涵盖**19,738个关键帧**)中,这些策略相较于简单基线仅带来边际改进,同时引入了额外计算成本与输出方差。我们的研究结果表明,现成VLM已能检测QA游戏视频中的部分视觉缺陷,但进一步突破可能需要融合文本与视觉异常检测的混合方法。