Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.
翻译:大语言模型(LLMs)及其多模态变体现已能够处理视觉输入,包括文本图像。这引发了一个引人深思的问题:能否通过将文本以图像形式输入来压缩文本输入,从而减少令牌使用量同时保持性能?本文证明,视觉文本表示对于解码器型大语言模型是一种实用且出奇有效的输入压缩形式。我们利用将长文本输入渲染为单张图像并直接提供给模型的思想,这显著减少了所需解码器令牌的数量,提供了一种新的输入压缩形式。通过在RULER(长上下文检索)和CNN/DailyMail(文档摘要)两个不同基准上的实验,我们证明这种“文本即图像”方法能在不降低任务性能的前提下实现显著的令牌节省(通常接近一半)。