Vision-and-language (VL) models with separate encoders for each modality (e.g., CLIP) have become the go-to models for zero-shot image classification and image-text retrieval. The bulk of the evaluation of these models is, however, performed with English text only: the costly creation of language-specific image-caption datasets has limited multilingual VL benchmarks to a handful of high-resource languages. In this work, we introduce Babel-ImageNet, a massively multilingual benchmark that offers (partial) translations of 1000 ImageNet labels to 92 languages, built without resorting to machine translation (MT) or requiring manual annotation. We instead automatically obtain reliable translations of ImageNext concepts by linking them -- via shared WordNet synsets -- to BabelNet, a massively multilingual lexico-semantic network. We evaluate 8 different publicly available multilingual CLIP models on zero-shot image classification (ZS-IC) for each of the 92 Babel-ImageNet languages, demonstrating a significant gap between English ImageNet performance and that of high-resource languages (e.g., German or Chinese), and an even bigger gap for low-resource languages (e.g., Sinhala or Lao). Crucially, we show that the models' ZS-IC performance on Babel-ImageNet highly correlates with their performance in image-text retrieval, validating that Babel-ImageNet is suitable for estimating the quality of the multilingual VL representation spaces for the vast majority of languages that lack gold image-text data. Finally, we show that the performance of multilingual CLIP for low-resource languages can be drastically improved via cheap, parameter-efficient language-specific training. We make our code and data publicly available: \url{https://github.com/gregor-ge/Babel-ImageNet}
翻译:视觉与语言(VL)模型中,各模态采用独立编码器(如CLIP)已成为零样本图像分类和图像-文本检索的主流方案。然而,这类模型的评估目前主要基于英语文本:由于针对特定语言构建图像-描述数据集成本高昂,多语言VL基准仅覆盖少数高资源语言。本研究提出Babel-ImageNet——一个大规模多语言基准,通过将1000个ImageNet标签(部分)翻译为92种语言,且无需依赖机器翻译(MT)或人工标注。我们通过共享WordNet同义词集,将ImageNet概念自动链接至BabelNet(一个大规模多语言词汇语义网络),从而获得可靠翻译。针对Babel-ImageNet的92种语言,我们评估了8个公开可用的多语言CLIP模型在零样本图像分类(ZS-IC)任务上的表现,发现英语ImageNet性能与高资源语言(如德语或中文)之间存在显著差距,而与低资源语言(如僧伽罗语或老挝语)的差距更为巨大。关键的是,模型在Babel-ImageNet上的ZS-IC性能与其图像-文本检索性能高度相关,验证了Babel-ImageNet可有效评估缺乏黄金图像-文本数据的绝大多数语言的多语言VL表示空间质量。最后,我们证明通过低成本、参数高效的语言专属训练,可显著提升多语言CLIP在低资源语言上的表现。我们将代码和数据公开:\url{https://github.com/gregor-ge/Babel-ImageNet}