Glyph：通过视觉-文本压缩扩展上下文窗口 (Glyph: Scaling Context Windows via Visual-Text Compression)

Jiale Cheng,Yusen Liu,Xinyu Zhang,Yulin Fei,Wenyi Hong,Ruiliang Lyu,Weihan Wang,Zhe Su,Xiaotao Gu,Xiao Liu,Yushi Bai,Jie Tang,Hongning Wang,Minlie Huang

Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.

翻译：大型语言模型（LLMs）在处理文档理解、代码分析和多步推理等任务时，日益依赖长上下文建模。然而，将上下文窗口扩展至百万标记级别会带来极高的计算和内存成本，限制了长上下文LLMs的实际应用。在本研究中，我们从一个不同的视角——视觉上下文扩展——来应对这一挑战。我们提出了Glyph框架，该框架并非扩展基于标记的序列，而是将长文本渲染为图像，并利用视觉语言模型（VLMs）进行处理。这种方法在保留语义信息的同时，显著压缩了文本输入；我们还进一步设计了一种由LLM驱动的遗传搜索，以识别最佳的视觉渲染配置，从而在准确性和压缩率之间取得平衡。通过大量实验，我们证明该方法在各种长上下文基准测试（如Qwen3-8B）上实现了3-4倍的标记压缩，同时保持了与主流LLMs相当的准确性。这种压缩还带来了约4倍的预填充和解码加速，以及约2倍的监督微调（SFT）训练加速。此外，在极端压缩下，一个128K上下文的VLM能够扩展以处理百万标记级别的文本任务。同时，渲染后的文本数据也有益于现实世界的多模态任务，例如文档理解。我们的代码和模型已在https://github.com/thu-coai/Glyph发布。