We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100$\times$ with minimal distortion. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input. The resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on image and video reconstruction benchmarks with 2.4$\times$ throughput compared to the best prior methods. Furthermore, by learning an autoregressive prior for adaptive arithmetic coding, BSQ-ViT achieves comparable results on video compression with state-of-the-art video compression standards. BSQ-ViT also enables masked language models to achieve competitive image synthesis quality to GAN- and diffusion-based methods.
翻译:本文提出了一种基于Transformer的新型图像与视频分词器,采用二进制球面量化技术。BSQ将高维视觉嵌入投影至低维超球面后进行二进制量化,具有三大特性:(1) 无需显式码本的高参数效率;(2) 可扩展至任意分词维度;(3) 高压缩性:能以100$\times$压缩比实现视觉数据的最小失真压缩。该分词器采用Transformer编码器-解码器架构,结合简单的分块因果掩码机制以支持可变长度视频输入。所提出的BSQ-ViT模型在图像与视频重建基准测试中达到最先进的视觉重建质量,且吞吐量较现有最优方法提升2.4$\times$。通过为自适应算术编码学习自回归先验,BSQ-ViT在视频压缩任务中达到与前沿视频压缩标准相当的性能。此外,BSQ-ViT赋能掩码语言模型在图像生成任务中取得与GAN及扩散方法相竞争的质量。