Vision-and-language (VL) models with separate encoders for each modality (e.g., CLIP) have become the go-to models for zero-shot image classification and image-text retrieval. They are, however, mostly evaluated in English as multilingual benchmarks are limited in availability. We introduce Babel-ImageNet, a massively multilingual benchmark that offers (partial) translations of ImageNet labels to 100 languages, built without machine translation or manual annotation. We instead automatically obtain reliable translations by linking them -- via shared WordNet synsets -- to BabelNet, a massively multilingual lexico-semantic network. We evaluate 11 public multilingual CLIP models on zero-shot image classification (ZS-IC) on our benchmark, demonstrating a significant gap between English ImageNet performance and that of high-resource languages (e.g., German or Chinese), and an even bigger gap for low-resource languages (e.g., Sinhala or Lao). Crucially, we show that the models' ZS-IC performance highly correlates with their performance in image-text retrieval, validating the use of Babel-ImageNet to evaluate multilingual models for the vast majority of languages without gold image-text data. Finally, we show that the performance of multilingual CLIP can be drastically improved for low-resource languages with parameter-efficient language-specific training. We make our code and data publicly available: \url{https://github.com/gregor-ge/Babel-ImageNet}
翻译:采用各模态独立编码器(例如CLIP)的视觉-语言(VL)模型已成为零样本图像分类和图文检索的首选模型。然而,由于多语言基准数据集的可用性有限,这些模型主要在英语环境下进行评估。本文提出了Babel-ImageNet,一个大规模多语言基准数据集,该数据集通过(部分)翻译ImageNet标签至100种语言构建而成,且未使用机器翻译或人工标注。我们通过共享的WordNet同义词集将标签与大规模多语言词汇语义网络BabelNet进行关联,从而自动获取可靠的翻译结果。我们在该基准上评估了11个公开的多语言CLIP模型的零样本图像分类(ZS-IC)性能,结果表明:英语ImageNet性能与高资源语言(如德语或中文)性能之间存在显著差距,而与低资源语言(如僧伽罗语或老挝语)性能之间的差距更为巨大。关键的是,我们发现模型的ZS-IC性能与其图文检索性能高度相关,这验证了Babel-ImageNet可用于评估绝大多数缺乏黄金图文数据的语言的多语言模型。最后,我们证明通过参数高效的语言特定训练,可以显著提升多语言CLIP在低资源语言上的性能。我们已公开代码与数据:\url{https://github.com/gregor-ge/Babel-ImageNet}