In language processing, transformers benefit greatly from text being condensed. This is achieved through a larger vocabulary that captures word fragments instead of plain characters. This is often done with Byte Pair Encoding. In the context of images, tokenisation of visual data is usually limited to regular grids obtained from quantisation methods, without global content awareness. Our work improves tokenisation of visual data by bringing Byte Pair Encoding from 1D to multiple dimensions, as a complementary add-on to existing compression. We achieve this through counting constellations of token pairs and replacing the most frequent token pair with a newly introduced token. The multidimensionality only increases the computation time by a factor of 2 for images, making it applicable even to large datasets like ImageNet within minutes on consumer hardware. This is a lossless preprocessing step. Our evaluation shows improved training and inference performance of transformers on visual data achieved by compressing frequent constellations of tokens: The resulting sequences are shorter, with more uniformly distributed information content, e.g. condensing empty regions in an image into single tokens. As our experiments show, these condensed sequences are easier to process. We additionally introduce a strategy to amplify this compression further by clustering the vocabulary.
翻译:在语言处理领域,Transformer 模型因文本被压缩而显著受益。这通常通过构建更大的词汇表来实现,该词汇表能够捕获词片段而非原始字符,而字节对编码是实现此目的的常用方法。在图像处理中,视觉数据的标记化通常局限于量化方法得到的规则网格,缺乏全局内容感知。本研究通过将字节对编码从一维扩展到多维,作为现有压缩方法的补充扩展,改进了视觉数据的标记化。我们通过统计标记对的共现模式,并用新引入的标记替换最频繁出现的标记对来实现这一目标。对于图像数据,多维扩展仅使计算时间增加约两倍,即使在消费级硬件上也能在数分钟内处理如 ImageNet 这样的大规模数据集。这是一个无损的预处理步骤。我们的评估表明,通过压缩频繁出现的标记组合,Transformer 在视觉数据上的训练和推理性能得到提升:生成的序列更短,信息分布更均匀,例如将图像中的空白区域压缩为单个标记。实验证明,这些压缩后的序列更易于处理。此外,我们引入了一种通过词汇表聚类进一步放大压缩效果的策略。