Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.
翻译:理解视觉退化是计算机视觉领域中一个至关重要且极具挑战性的问题。尽管当前的视觉语言模型在定性描述方面表现出色,但在理解图像退化背后的参数化物理原理方面往往存在不足。在本研究中,我们将退化理解重新定义为一种层次化结构预测任务,要求同时估计退化类型、参数键及其连续的物理值。尽管这些子任务在不同的空间中运行,但我们证明了它们可以在一个自回归的下一个词元预测范式下统一起来,其误差受值空间量化网格的约束。基于这一见解,我们提出了DU-VLM,这是一个多模态思维链模型,通过监督微调和采用结构化奖励的强化学习进行训练。此外,我们展示了DU-VLM可以作为预训练扩散模型的零样本控制器,无需对生成主干网络进行微调即可实现高保真度的图像恢复。我们还引入了\textbf{DU-110k},这是一个包含110,000个干净-退化图像对的大规模数据集,每个图像对都带有基于物理原理的标注。大量实验表明,我们的方法在准确性和鲁棒性上均显著优于通用基线模型,并对未见过的数据分布展现出良好的泛化能力。