We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/
翻译:我们实证研究了视频的自回归预训练方法。为开展此项研究,我们构建了一系列名为Toto的自回归视频模型。我们将视频视为视觉标记序列,并训练Transformer模型以自回归方式预测未来标记。我们的模型在包含超过1万亿视觉标记的多样化视频和图像数据集上进行预训练。我们探索了不同的架构设计、训练策略和推理方案。我们在包括图像识别、视频分类、目标跟踪和机器人学在内的一系列下游任务上评估了学习到的视觉表征。我们的结果表明,尽管引入的归纳偏置极少,自回归预训练在所有基准测试中均能取得具有竞争力的性能。最后,我们发现扩展视频模型会产生与语言模型相似的扩展曲线,尽管其变化速率有所不同。更多细节请访问 https://brjathu.github.io/toto/