We introduce a formal information-theoretic framework for image captioning by regarding it as a representation learning task. Our framework defines three key objectives: task sufficiency, minimal redundancy, and human interpretability. Building upon this foundation, we propose a novel Pyramid of Captions (PoCa) method, which constructs caption pyramids by generating localized captions for zoomed-in image patches and integrating them with global caption information using large language models. This approach leverages intuition that the detailed examination of local patches can reduce error risks and address inaccuracies in global captions, either by correcting the hallucination or adding missing details. Based on our theoretical framework, we formalize this intuition and provide formal proof demonstrating the effectiveness of PoCa under certain assumptions. Empirical tests with various image captioning models and large language models show that PoCa consistently yields more informative and semantically aligned captions, maintaining brevity and interpretability.
翻译:我们引入了一种形式化的信息论框架,将图像字幕生成视为表示学习任务。该框架定义了三个关键目标:任务充分性、最小冗余性和人类可解释性。基于这一基础,我们提出了一种新颖的"字幕金字塔"(PoCa)方法,通过为放大的图像块生成局部字幕,并利用大语言模型将其与全局字幕信息整合,从而构建字幕金字塔。该方法利用以下直觉:对局部块的细致检查可以降低错误风险,并通过修正幻觉或添加缺失细节来弥补全局字幕中的不准确之处。基于我们的理论框架,我们对这一直觉进行了形式化表达,并在特定假设下提供了形式化证明,证实了PoCa的有效性。多种图像字幕生成模型和大语言模型上的实证测试表明,PoCa始终能生成信息更丰富、语义更一致的字幕,同时保持简洁性和可解释性。