Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.
翻译:当前视觉系统通常为图像分配固定长度的表示,而忽略了信息内容的差异。这与人类智能——甚至大型语言模型——形成鲜明对比,后者会根据熵、上下文和熟悉程度动态分配不同的表示容量。受此启发,我们提出了一种为二维图像学习可变长度标记表示的方法。我们的编码器-解码器架构通过循环展开的多次迭代,递归处理二维图像标记,并将其提炼为一维潜在标记。每次迭代都会细化二维标记,更新现有的一维潜在标记,并通过添加新标记自适应地增加表示容量。这使得图像能够被压缩为数量可变的标记,范围从32到256个。我们使用重建损失和FID指标验证了我们的标记器,证明标记数量与图像熵、熟悉度以及下游任务需求相一致。每次迭代中表示容量递增的循环标记处理过程显示出标记专业化的迹象,揭示了对象/部件发现的潜力。