Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks, yet often suffer from inefficiencies due to redundant visual tokens. Existing token merging methods reduce sequence length but frequently disrupt spatial layouts and temporal continuity by disregarding positional relationships. In this work, we propose a novel encoding operator dubbed as \textbf{P}ositional \textbf{P}reservation \textbf{E}mbedding (\textbf{PPE}), which has the main hallmark of preservation of spatiotemporal structure during visual token compression. PPE explicitly introduces the disentangled encoding of 3D positions in the token dimension, enabling each compressed token to encapsulate different positions from multiple original tokens. Furthermore, we show that PPE can effectively support cascade clustering -- a progressive token compression strategy that leads to better performance retention. PPE is a parameter-free and generic operator that can be seamlessly integrated into existing token merging methods without any adjustments. Applied to state-of-the-art token merging framework, PPE achieves consistent improvements of $2\%\sim5\%$ across multiple vision-language benchmarks, including MMBench (general vision understanding), TextVQA (layout understanding) and VideoMME (temporal understanding). These results demonstrate that preserving positional cues is critical for efficient and effective MLLM reasoning. Our code is available at https://github.com/MouxiaoHuang/PPE.
翻译:多模态大语言模型(MLLMs)在视觉-语言任务上取得了强劲的性能,但常因冗余的视觉令牌而导致效率低下。现有的令牌合并方法通过减少序列长度来提升效率,却往往因忽略位置关系而破坏了空间布局与时间连续性。本文提出一种新颖的编码算子,称为**位置保持嵌入**(**PPE**),其核心特征是在视觉令牌压缩过程中保持时空结构。PPE在令牌维度显式引入解耦的三维位置编码,使每个压缩后的令牌能够封装来自多个原始令牌的不同位置信息。此外,我们证明PPE能有效支持级联聚类——一种渐进式令牌压缩策略,可实现更好的性能保持。PPE是一种无需参数、通用性强的算子,无需任何调整即可无缝集成到现有的令牌合并方法中。应用于最先进的令牌合并框架时,PPE在多个视觉-语言基准测试中实现了$2\%\sim5\%$的稳定性能提升,包括MMBench(通用视觉理解)、TextVQA(布局理解)和VideoMME(时序理解)。这些结果表明,保持位置线索对于实现高效且有效的MLLM推理至关重要。代码已开源:https://github.com/MouxiaoHuang/PPE。