Recognize Anything: A Strong Image Tagging Model

We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google tagging API. We are releasing the RAM at \url{https://recognize-anything.github.io/} to foster the advancements of large models in computer vision.

翻译：我们提出了Recognize Anything Model（RAM）：一个用于图像标注的强大基础模型。RAM为计算机视觉领域的大模型迈出了实质性的一步，展示了零样本识别任何常见类别并达到高准确率的能力。RAM引入了一种全新的图像标注范式，利用大规模图像-文本对进行训练，而非依赖人工标注。RAM的开发包含四个关键步骤。首先，通过自动文本语义解析，在规模化层面获取无需人工标注的图像标签。随后，通过统一描述与标注任务（分别以原始文本和解析标签作为监督信号），训练一个初始模型用于自动标注。第三，利用数据引擎生成额外标注并清理错误标注。最后，使用处理后的数据对模型进行重新训练，并通过更小但更高质量的数据集进行微调。我们在多个基准测试上评估了RAM的标注能力，观察到其零样本性能令人印象深刻，显著优于CLIP和BLIP。值得注意的是，RAM甚至超越了全监督方法，并与谷歌标注API展现出具有竞争力的性能。我们已在\url{https://recognize-anything.github.io/}发布RAM，以推动计算机视觉中大模型的发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日