Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image's complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.
翻译:大型视觉语言模型在图像描述任务中面临一个固有矛盾:其强大的单步生成能力往往导致短视的决策过程。这使得模型在捕捉丰富细节的同时难以维持全局叙事连贯性,这一局限在需要多步骤和复杂场景描述的任务中尤为突出。为克服这一根本性挑战,我们将图像描述重新定义为一种面向目标的分层精炼规划问题,并进一步提出一个名为自上而下语义精炼的新框架。该框架将生成过程建模为马尔可夫决策过程。然而,在视觉语言模型庞大的状态空间中进行规划存在显著的计算障碍。因此,我们的核心贡献是设计了一种专为视觉语言模型定制的高效蒙特卡洛树搜索算法。通过引入视觉引导的并行扩展和轻量级价值网络,我们的方法在不牺牲规划质量的前提下,将调用昂贵视觉语言模型的频率降低了一个数量级。此外,一种自适应早停机制能根据图像复杂度动态匹配计算开销。在多个基准测试上的大量实验表明,我们的方法作为一个即插即用模块,能够显著提升现有视觉语言模型的性能,在细粒度描述、组合泛化和幻觉抑制方面达到最先进或极具竞争力的结果。