Existing Vision Language Models (VLMs) often struggle to preserve logic, entity identity, and artistic style during extended, interleaved image-text interactions. We identify this limitation as "Multimodal Context Drift", which stems from the inherent tendency of implicit neural representations to decay or become entangled over long sequences. To bridge this gap, we propose IUT-Plug, a model-agnostic Neuro-Symbolic Structured State Tracking mechanism. Unlike purely neural approaches that rely on transient attention maps, IUT-Plug introduces the Image Understanding Tree (IUT) as an explicit, persistent memory module. The framework operates by (1) parsing visual scenes into hierarchical symbolic structures (entities, attributes, and relationships); (2) performing incremental state updates to logically lock invariant properties while modifying changing elements; and (3) guiding generation through topological constraints. We evaluate our approach on a novel benchmark comprising 3,000 human-annotated samples. Experimental results demonstrate that IUT-Plug effectively mitigates context drift, achieving significantly higher consistency scores compared to unstructured text-prompting baselines. This confirms that explicit symbolic grounding is essential for maintaining robust long-horizon consistency in multimodal generation.
翻译:现有的视觉语言模型在长时间的交错图像-文本交互中,往往难以保持逻辑、实体身份与艺术风格的一致性。我们将这一局限定义为“多模态语境漂移”,其根源在于隐含神经表征在长序列处理中固有的衰减或纠缠倾向。为弥合这一差距,我们提出IUT-Plug——一种与模型无关的神经符号结构化状态追踪机制。与依赖瞬态注意力图的纯神经方法不同,IUT-Plug引入图像理解树作为显式持久化记忆模块。该框架通过以下步骤运作:(1) 将视觉场景解析为层次化符号结构(实体、属性及关系);(2) 执行增量状态更新,在修改变化元素的同时逻辑锁定不变属性;(3) 通过拓扑约束引导生成过程。我们在包含3000个人工标注样本的新基准上评估所提方法。实验结果表明,IUT-Plug能有效缓解语境漂移,相较于非结构化文本提示基线获得显著更高的一致性分数。这证实了显式符号 grounding 对于维持多模态生成长程一致性的关键作用。