Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression

Large Language Models (LLMs) struggle with long-context code due to window limitations. Existing textual code compression methods mitigate this via selective filtering but often disrupt dependency closure, causing semantic fragmentation. To address this, we introduce LongCodeOCR, a visual compression framework that renders code into compressed two-dimensional image sequences for Vision-Language Models (VLMs). By preserving a global view, this approach avoids the dependency breakage inherent in filtering. We systematically evaluate LongCodeOCR against the state-of-the-art LongCodeZip across four benchmarks spanning code summarization, code question answering, and code completion. Our results demonstrate that visual code compression serves as a viable alternative for tasks requiring global understanding. At comparable compression ratios ($\sim$1.7$\times$), LongCodeOCR improves CompScore on Long Module Summarization by 36.85 points over LongCodeZip. At a 1M-token context length with Glyph (a specialized 9B VLM), LongCodeOCR maintains higher accuracy than LongCodeZip while operating at about 4$\times$ higher compression. Moreover, compared with LongCodeZip, LongCodeOCR drastically reduces compression-stage overhead (reducing latency from $\sim$4.3 hours to $\sim$1 minute at 1M tokens). Finally, our results characterize a fundamental coverage--fidelity trade-off: visual code compression retains broader context coverage to support global dependencies, yet faces fidelity bottlenecks on exactness-critical tasks; by contrast, textual code compression preserves symbol-level precision while sacrificing structural coverage.

翻译：大型语言模型（LLMs）因窗口限制难以处理长上下文代码。现有的文本代码压缩方法通过选择性过滤缓解此问题，但常破坏依赖闭包，导致语义碎片化。为此，我们提出LongCodeOCR——一种视觉压缩框架，将代码渲染为压缩的二维图像序列供视觉-语言模型（VLMs）处理。该方法通过保持全局视图，避免了过滤方法固有的依赖断裂问题。我们在涵盖代码摘要、代码问答和代码补全的四个基准测试中，系统评估了LongCodeOCR与当前最先进的LongCodeZip方法。实验结果表明，对于需要全局理解的任务，视觉代码压缩是一种可行的替代方案。在相近压缩比（$\sim$1.7$\times$）下，LongCodeOCR在长模块摘要任务上的CompScore比LongCodeZip提升36.85分。在100万token的上下文长度下，配合专用9B VLM模型Glyph，LongCodeOCR在实现约4$\times$更高压缩率的同时，仍保持比LongCodeZip更高的准确率。此外，相较于LongCodeZip，LongCodeOCR大幅降低了压缩阶段的开销（在100万token规模下将延迟从$\sim$4.3小时缩减至$\sim$1分钟）。最后，我们的研究揭示了一个根本性的覆盖度-保真度权衡规律：视觉代码压缩保留更广泛的上下文覆盖以支持全局依赖，但在精确性关键任务上面临保真度瓶颈；相比之下，文本代码压缩能保持符号级精度，但牺牲了结构覆盖度。