Benchmarking Multi-Modal LLMs for Testing Visual Deep Learning Systems Through the Lens of Image Mutation

Visual deep learning (VDL) systems have shown significant success in real-world applications like image recognition, object detection, and autonomous driving. To evaluate the reliability of VDL, a mainstream approach is software testing, which requires diverse and controllable mutations over image semantics. The rapid development of multi-modal large language models (MLLMs) has introduced revolutionary image mutation potentials through instruction-driven methods. Users can now freely describe desired mutations and let MLLMs generate the mutated images. However, the quality of MLLM-produced test inputs in VDL testing remains largely unexplored. We present the first study, aiming to assess MLLMs' adequacy from 1) the semantic validity of MLLM mutated images, 2) the alignment of MLLM mutated images with their text instructions (prompts), 3) the faithfulness of how different mutations preserve semantics that are ought to remain unchanged, and 4) the effectiveness of detecting VDL faults. With large-scale human studies and quantitative evaluations, we identify MLLM's promising potentials in expanding the covered semantics of image mutations. Notably, while SoTA MLLMs (e.g., GPT-4V) fail to support or perform worse in editing existing semantics in images (as in traditional mutations like rotation), they generate high-quality test inputs using "semantic-additive" mutations (e.g., "dress a dog with clothes"), which bring extra semantics to images; these were infeasible for past approaches. Hence, we view MLLM-based mutations as a vital complement to traditional mutations, and advocate future VDL testing tasks to combine MLLM-based methods and traditional image mutations for comprehensive and reliable testing.

翻译：视觉深度学习系统在图像识别、目标检测和自动驾驶等实际应用中取得了显著成功。评估其可靠性的主流方法之一是软件测试，这需要对图像语义进行多样化且可控的变异。多模态大语言模型的快速发展通过指令驱动方法引入了革命性的图像变异潜力——用户现在可以自由描述期望的变异，由多模态大语言模型生成变异图像。然而，多模态大语言模型生成的测试输入在视觉深度学习测试中的质量仍未被充分探索。我们开展首项研究，旨在从以下四个维度评估多模态大语言模型的充分性：1) 多模态大语言模型变异图像的语义有效性，2) 变异图像与文本指令的对齐程度，3) 不同变异操作对应当保持不变语义的保真度，4) 检测视觉深度学习缺陷的有效性。通过大规模人工评估与定量实验，我们发现多模态大语言模型在扩展图像变异覆盖语义方面具有显著潜力。值得注意的是，虽然当前最先进的多模态大语言模型（如GPT-4V）在支持或执行图像既有语义编辑任务（如传统旋转变异）时表现不佳，但通过"语义叠加"变异（例如"给狗穿上衣服"）能生成高质量测试输入——这些为图像带来额外语义的操作是传统方法无法实现的。因此，我们认为基于多模态大语言模型的变异是对传统变异的重要补充，并建议未来视觉深度学习测试任务应结合多模态大语言模型方法与传统图像变异，以实现全面可靠的测试。