Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.
翻译:大型视觉语言模型(VLMs)已通过迭代优化方法在复杂视觉理解任务中展现出巨大潜力。然而,这些模型普遍缺乏有效的自我修正机制,难以独立纠正认知偏差。因此,在多轮修正过程中,它们常陷入重复且无效的尝试,无法实现答案质量的稳定提升。为解决这一问题,我们提出了一种新颖的迭代自修正框架,赋予模型两项关键能力:能力反思与记忆反思。该框架引导模型首先通过能力反思诊断错误并生成修正计划,继而借助记忆反思回顾过往尝试以避免重复并探索新解决方案,最终通过严谨的重新推理优化答案。在挑战性基准测试OCRBench v2上的实验表明,OCR-Agent在英文子集上超越当前开源SOTA模型InternVL3-8B达+2.0分,中文子集上达+1.2分,同时在视觉理解(79.9)与推理(66.5)任务中取得最先进成果——甚至超越了规模更大的精调模型。我们的方法证明,结构化、具备自我意识的反思机制能显著增强VLMs的推理鲁棒性,且无需额外训练。代码:https://github.com/AIGeeksGroup/OCR-Agent。