Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational costs of processing massive, redundant audio-visual tokens. Existing unimodal compression techniques fail to capture the heterogeneous and mutually influential information density of joint audio-visual signals. Furthermore, we identify a fundamental and overlooked theoretical bottleneck in sparse token reduction: positional aliasing. We demonstrate that aggressive sparse sampling on standard position-encoded sequences violates the Nyquist limit relative to the effective token interval, causing phase-wrapping collisions that corrupt temporal monotonicity. To address this, we introduce EchoingPixels, a framework for aliasing-resistant joint token reduction. Our Cross-Modal Semantic Sieve performs extractive selection on the synergistic audio-visual stream, dynamically allocating budgets based on joint-modality saliency rather than fixed per-modality ratios. To resolve positional aliasing, we derive Sync-RoPE, a spectral low-pass filter for Rotary Positional Embeddings that adapts encoding bandwidth to the sparse sampling rate, preserving monotonic temporal relationships in the reduced stream. Experiments show that EchoingPixels achieves performance comparable to full models using only 5-20% of original tokens, validating theoretically grounded sparse learning as a robust solution for efficient AV-LLMs. Code is available at https://github.com/CharlesGong12/EchoingPixels.
翻译:暂无翻译