We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.
翻译:我们提出了一种新颖的序列建模方法,该方法无需使用任何语言数据即可实现大规模视觉模型(LVM)的学习。为此,我们定义了一种通用格式"视觉句子",通过该格式,我们能够表示原始图像和视频,以及语义分割和深度重建等标注数据源,而无需像素之外的任何元知识。当这种包含4200亿个token的多样化视觉数据被表示为序列后,模型可通过最小化下一个token预测的交叉熵损失进行训练。通过在不同规模的模型架构和数据多样性下进行训练,我们提供了经验证据表明模型具有良好的可扩展性。通过在测试阶段设计合适的视觉提示,可以解决许多不同的视觉任务。