Despite progress in image tokenization, standard methods encode redundant information by mixing all granularities within each token, thus redundancy persists between tokens. The mix of information of different granularity also complicates the training of generators. This paper introduces SelfBootTok, a method that resolves this by cleanly decomposing information into global and local token groups. Through self-bootstrapped learning, the model predicts local details exclusively from global tokens, shifting the burden of visual details from the generator to the tokenizer. Consequently, our generator is far more efficient, requiring only global tokens and reducing computation by approximately 40%, while delivering superior reconstruction and generation. Moreover, this paradigm scales elegantly: by leveraging more data or parameters to self-supervise local representation learning, SelfBootTok achieves a new state-of-the-art gFID score of 1.56 using only 64 tokens.
翻译:尽管图像标记化技术取得了进展,但标准方法将不同粒度的信息混合在每个标记中,导致标记之间仍存在冗余。这种不同粒度信息的混合也增加了生成器训练的复杂性。本文提出SelfBootTok方法,通过将信息清晰分解为全局和局部标记组来解决这一问题。通过自举学习,模型仅从全局标记预测局部细节,将视觉细节的负担从生成器转移到标记器。因此,我们的生成器效率大幅提升,仅需全局标记即可将计算量减少约40%,同时实现更优的重建与生成性能。此外,该范式具有优雅的可扩展性:通过利用更多数据或参数进行局部表示学习的自监督训练,SelfBootTok仅需64个标记即达到1.56的最优gFID分数,创造了新的行业标杆。