We present Soft Tail-dropping Adaptive Tokenizer (STAT), a 1D discrete visual tokenizer that adaptively chooses the number of output tokens per image according to its structural complexity and level of detail. STAT encodes an image into a sequence of discrete codes together with per-token keep probabilities. Beyond standard autoencoder objectives, we regularize these keep probabilities to be monotonically decreasing along the sequence and explicitly align their distribution with an image-level complexity measure. As a result, STAT produces length-adaptive 1D visual tokens that are naturally compatible with causal 1D autoregressive (AR) visual generative models. On ImageNet-1k, equipping vanilla causal AR models with STAT yields competitive or superior visual generation quality compared to other probabilistic model families, while also exhibiting favorable scaling behavior that has been elusive in prior vanilla AR visual generation attempts.
翻译:我们提出软性尾部丢弃自适应分词器(STAT),这是一种一维离散视觉分词器,能够根据图像的结构复杂性和细节水平自适应选择每张图像的输出标记数量。STAT将图像编码为离散代码序列及每个标记的保留概率。除了标准自编码器目标外,我们通过正则化使这些保留概率沿序列单调递减,并显式对齐其分布与图像级复杂度度量。因此,STAT生成的长度自适应一维视觉标记天然兼容因果一维自回归视觉生成模型。在ImageNet-1k数据集上,为经典因果自回归模型配备STAT后,相比其他概率模型族能获得相当或更优的视觉生成质量,同时展现出先前经典自回归视觉生成尝试中难以实现的良好缩放特性。