We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.
翻译:我们提出了Xray-Visual,这是一种用于大规模图像与视频理解的统一视觉模型架构,其训练基于工业规模的社交媒体数据。我们的模型利用了来自Facebook和Instagram超过150亿个经过筛选的图像-文本对和100亿个视频-话题标签对,采用了包含平衡与噪声抑制策略的鲁棒数据筛选流程,以最大化语义多样性,同时最小化标签噪声。我们引入了一个三阶段训练流程,结合了自监督的MAE、半监督的话题标签分类以及CLIP风格的对比学习,以联合优化图像与视频模态。我们的架构建立在Vision Transformer骨干网络之上,并通过高效的令牌重组(EViT)进行增强,以提高计算效率。大量实验表明,Xray-Visual在多种基准测试中均实现了最先进的性能,包括用于图像分类的ImageNet、用于视频理解的Kinetics和HMDB51,以及用于跨模态检索的MSCOCO。该模型对领域偏移和对抗性扰动表现出强大的鲁棒性。我们进一步证明,将大型语言模型作为文本编码器集成(LLM2CLIP)能显著提升检索性能和泛化能力,尤其是在真实世界环境中。Xray-Visual为可扩展的多模态视觉模型设立了新的基准,同时保持了卓越的准确性和计算效率。