C^2RoPE：面向三维大型多模态模型推理的因果连续旋转位置编码 (C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning)

Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. However, the inherited Rotary Position Embedding (RoPE) introduces limitations for multimodal processing. Specifically, applying 1D temporal positional indices disrupts the continuity of visual features along the column dimension, resulting in spatial locality loss. Moreover, RoPE follows the prior that temporally closer image tokens are more causally related, leading to long-term decay in attention allocation and causing the model to progressively neglect earlier visual tokens as the sequence length increases. To address these issues, we propose C^2RoPE, an improved RoPE that explicitly models local spatial Continuity and spatial Causal relationships for visual processing. C^2RoPE introduces a spatio-temporal continuous positional embedding mechanism for visual tokens. It first integrates 1D temporal positions with Cartesian-based spatial coordinates to construct a triplet hybrid positional index, and then employs a frequency allocation strategy to encode spatio-temporal positional information across the three index components. Additionally, we introduce Chebyshev Causal Masking, which determines causal dependencies by computing the Chebyshev distance of image tokens in 2D space. Evaluation results across various benchmarks, including 3D scene reasoning and 3D visual question answering, demonstrate C^2RoPE's effectiveness. The code is be available at https://github.com/ErikZ719/C2RoPE.

翻译：基于大型语言模型（LLM）构建的三维大型多模态模型（LMM）的最新进展，确立了将三维视觉特征与LLM表征对齐的主导范式。然而，其继承的旋转位置嵌入（RoPE）为多模态处理引入了局限性。具体而言，应用一维时序位置索引会破坏视觉特征沿列维度的连续性，导致空间局部性损失。此外，RoPE遵循时序上更接近的图像标记具有更强因果关联的先验，导致注意力分配的长程衰减，使得模型随着序列长度增加逐渐忽略先前的视觉标记。为解决这些问题，我们提出C^2RoPE——一种改进的RoPE方法，显式建模视觉处理中的局部空间连续性与空间因果关系。C^2RoPE为视觉标记引入了时空连续的位置嵌入机制：首先将一维时序位置与基于笛卡尔坐标的空间坐标结合，构建三元组混合位置索引；随后采用频率分配策略在三个索引分量上编码时空位置信息。此外，我们提出切比雪夫因果掩码，通过计算图像标记在二维空间中的切比雪夫距离来确定因果依赖关系。在包括三维场景推理和三维视觉问答在内的多种基准测试中的评估结果，验证了C^2RoPE的有效性。代码公开于https://github.com/ErikZ719/C2RoPE。