Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross-modal segmentation. Within each segment, token retention is determined by a tri-signal importance estimator that fuses structural boundary cues, representational distinctiveness, and attention-based salience, mitigating the sparsity bias of attention-only selection. This structure-aware allocation preserves transition-critical tokens while reducing redundant regions. Extensive experiments on AVUT, VideoMME, and WorldSense demonstrate that DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods. Code is available at: https://github.com/laychou666/DASH.
翻译:全模态大语言模型(OmniLLMs)联合处理音频与视觉流,但由此产生的长序列多模态令牌使得推理成本高昂。现有压缩方法通常依赖固定窗口划分与基于注意力的剪枝策略,忽视了视听信号的分段语义结构,在激进令牌压缩下表现脆弱。我们提出基于动态音频驱动的语义分块方法(DASH),这是一种免训练框架,能将令牌压缩与语义结构对齐。DASH将音频嵌入作为语义锚点,通过余弦相似度不连续性检测边界候选点,生成近似序列底层分段连贯性组织的动态可变长度片段。这些边界被投影到视频令牌上以建立显式跨模态分割。在每个片段内,令牌保留由三信号重要性估计器决定,该估计器融合结构边界线索、表征独特性与基于注意力的显著性,缓解了纯注意力选择带来的稀疏性偏差。这种结构感知分配机制在保留关键过渡令牌的同时减少冗余区域。在AVUT、VideoMME和WorldSense数据集上的大量实验表明,相比现有方法,DASH在保持更高精度的同时实现了更高的压缩比。代码开源于:https://github.com/laychou666/DASH。