Pixel-based language models process text rendered as images, which allows them to handle any script, making them a promising approach to open vocabulary language modelling. However, recent approaches use text renderers that produce a large set of almost-equivalent input patches, which may prove sub-optimal for downstream tasks, due to redundancy in the input representations. In this paper, we investigate four approaches to rendering text in the PIXEL model (Rust et al., 2023), and find that simple character bigram rendering brings improved performance on sentence-level tasks without compromising performance on token-level or multilingual tasks. This new rendering strategy also makes it possible to train a more compact model with only 22M parameters that performs on par with the original 86M parameter model. Our analyses show that character bigram rendering leads to a consistently better model but with an anisotropic patch embedding space, driven by a patch frequency bias, highlighting the connections between image patch- and tokenization-based language models.
翻译:基于像素的语言模型将文本渲染为图像进行处理,使其能够应对任意文字系统,从而成为开放词汇语言建模的一种有前景的方法。然而,近期方法采用的文本渲染器会产生大量几乎等价的输入图像块,由于输入表示中的冗余性,这可能对下游任务而言并非最优。本文研究了PIXEL模型(Rust等,2023)中四种文本渲染方法,发现简单的字符双字母组渲染能在不损害标记级或多语言任务性能的前提下,提升句子级任务的表现。这种新的渲染策略还使得仅用2200万个参数训练出与原始8600万参数模型性能相当的更紧凑模型成为可能。我们的分析表明,字符双字母组渲染能持续产生更优的模型,但其图像块嵌入空间具有各向异性,这种特性由图像块频率偏差驱动,从而凸显了基于图像块与基于分词的语言模型之间的关联。