Existing vision-language model (VLM)-based AI-generated image quality assessment (AIGIQA) methods suffer from a fundamental semantic-distortion dimensional conflict: monolithic representations optimized for semantic discrimination inherently entangle compositional understanding with low-level perceptual sensitivity, rendering them blind to fine-grained quality degradations. We introduce MST-CLIPIQA, a multi-scale two-stream framework that achieves hierarchical vision-language alignment through explicit representational decoupling. Our architecture leverages dual CLIP encoders with complementary patch granularities: coarse-grained streams capture global semantic coherence while fine-grained streams preserve textural signatures and artifact patterns. An information bottleneck-inspired gated fusion mechanism performs adaptive cross-scale distillation, with optional cross-attention enabling prompt-anchored correspondence evaluation when generation prompts are available. Extensive experiments across five benchmarks establish new state-of-the-art results, achieving average improvements of 1.11 percent SRCC on quality and 2.35 percent SRCC on text-image correspondence prediction, while maintaining efficiency with only 0.8M trainable parameters. Our project is available at https://github.com/YMlinfeng/MST-CLIPIQA.
翻译:现有基于视觉语言模型(VLM)的AI生成图像质量评估(AIGIQA)方法面临根本性的语义-失真维度冲突:为语义判别优化的单一表示天然地将组合式理解与底层感知敏感性纠缠在一起,导致其对细粒度质量退化"视而不见"。我们提出MST-CLIPIQA,一种通过显式表示解耦实现分层视觉-语言对齐的多尺度双流框架。该架构采用具有互补块粒度的双CLIP编码器:粗粒度流捕捉全局语义连贯性,而细粒度流保留纹理特征与伪影模式。基于信息瓶颈的门控融合机制实现自适应跨尺度蒸馏,当生成提示可用时,可选交叉注意力机制支持锚定于提示的对应关系评估。在五个基准上的广泛实验确立了新的最先进结果,在质量预测和文本-图像对应预测任务上分别实现平均1.11%的SRCC提升和2.35%的SRCC提升,同时仅需0.8M可训练参数保持高效性。项目代码开源在https://github.com/YMlinfeng/MST-CLIPIQA。