Recent works showed the possibility of building open-vocabulary large language models (LLMs) that directly operate on pixel representations and are implemented as encoder-decoder models that reconstruct masked image patches of rendered text. However, these pixel-based LLMs are limited to autoencoding tasks and cannot generate new text as images. As such, they cannot be used for open-answer or generative language tasks. In this work, we overcome this limitation and introduce PIXAR, the first pixel-based autoregressive LLM that does not rely on a pre-defined vocabulary for both input and output text. Consisting of only a decoder, PIXAR can answer free-form generative tasks while keeping the text representation learning performance on par with previous encoder-decoder models. Furthermore, we highlight the challenges to autoregressively generate non-blurred text as images and link this to the usual maximum likelihood objective. We propose a simple adversarial pretraining that significantly improves the readability and performance of PIXAR making it comparable to GPT2 on short text generation tasks. This paves the way to building open-vocabulary LLMs that are usable for free-form generative tasks and questions the necessity of the usual symbolic input representation -- text as tokens -- for these challenging tasks.
翻译:摘要:近年研究表明,构建直接操作于像素表示的开词汇量大型语言模型(LLM)成为可能。此类模型通常实现为编码-解码结构,通过重建渲染文本的掩码图像块来运作。然而,这些基于像素的LLM仅限于自编码任务,无法以图像形式生成新文本,因此无法用于开放答案或生成式语言任务。本研究突破这一限制,提出PIXAR——首个无需预定义词汇表即可处理输入输出文本的基于像素的自回归LLM。该模型仅由解码器构成,能在保持文本表示学习性能与先前编码-解码模型相当的同时,回答自由形式的生成式任务。此外,我们揭示了以图像形式自回归生成非模糊文本的挑战,并将其与常规最大似然目标相关联。我们提出一种简单的对抗预训练方法,显著提升了PIXAR的可读性和性能,使其在短文本生成任务中可与GPT2媲美。这为构建可用于自由形式生成任务的开词汇量LLM铺平了道路,同时对这类挑战性任务中传统符号化输入表示(即文本令牌)的必要性提出了质疑。