Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: First, existing CG datasets lack systematic descriptions of rendering quality; and second existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3500 CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM's understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment.
翻译:沉浸式计算机图形渲染在现代日常生活中已无处不在。然而,全面评估计算机图形质量仍面临两大挑战:首先,现有计算机图形数据集缺乏对渲染质量的系统性描述;其次,现有计算机图形质量评估方法无法提供合理的基于文本的解释。为解决这些问题,我们首先从用户视角识别出计算机图形质量的六个关键感知维度,并构建了一个包含3500张计算机图形图像及对应质量描述的数据集。每条描述均涵盖计算机图形的风格、内容以及在选定维度上的感知质量。此外,我们利用数据集的子集,基于这些描述构建了若干问答基准,以评估现有视觉语言模型的响应能力。我们发现,当前视觉语言模型在判断细粒度计算机图形质量方面准确性不足,但视觉相似图像的描述能显著提升视觉语言模型对给定计算机图形图像的理解。受此观察启发,我们采用检索增强生成技术,提出了一种双流检索框架,有效增强了视觉语言模型的计算机图形质量评估能力。在多个代表性视觉语言模型上的实验表明,我们的方法显著提升了它们在计算机图形质量评估任务上的性能。