Recognize Anything: A Strong Image Tagging Model

We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM can recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google API. We are releasing the RAM at \url{https://recognize-anything.github.io/} to foster the advancements of large models in computer vision.

翻译：我们提出了“识别万物模型”（RAM）：一个用于图像标注的强大基础模型。RAM能够高精度地识别任意常见类别。RAM引入了一种新的图像标注范式，利用大规模图文对进行训练而无需人工标注。RAM的开发包含四个关键步骤。首先，通过自动文本语义解析，大规模获取无标注的图像标签。随后，通过统一描述和标注任务，由原始文本和解析后的标签分别监督，训练一个初步模型用于自动标注。第三，采用数据引擎生成额外标注并清理错误标注。最后，使用处理后的数据重新训练模型，并使用更小但更高质量的数据集进行微调。我们在多个基准测试上评估了RAM的标注能力，观察到其强大的零样本性能，显著优于CLIP和BLIP。值得注意的是，RAM甚至超越了全监督方法，并与谷歌API的性能相匹敌。我们在\url{https://recognize-anything.github.io/}上发布RAM，以促进计算机视觉领域大模型的发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/