This work introduces CAPIVARA, a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. While CLIP has excelled in zero-shot vision-language tasks, the resource-intensive nature of model training remains challenging. Many datasets lack linguistic diversity, featuring solely English descriptions for images. CAPIVARA addresses this by augmenting text data using image captioning and machine translation to generate multiple synthetic captions in low-resource languages. We optimize the training pipeline with LiT, LoRA, and gradient checkpointing to alleviate the computational cost. Through extensive experiments, CAPIVARA emerges as state of the art in zero-shot tasks involving images and Portuguese texts. We show the potential for significant improvements in other low-resource languages, achieved by fine-tuning the pre-trained multilingual CLIP using CAPIVARA on a single GPU for 2 hours. Our model and code is available at https://github.com/hiaac-nlp/CAPIVARA.
翻译:本文提出CAPIVARA,一种旨在提升多语言CLIP模型在低资源语言中性能的成本高效框架。尽管CLIP在零样本视觉-语言任务中表现卓越,模型训练的资源密集型特性仍是挑战。许多数据集缺乏语言多样性,仅包含图像的英文描述。CAPIVARA通过利用图像字幕生成与机器翻译对文本数据进行增强,为低资源语言生成多条合成字幕。我们采用LiT、LoRA与梯度检查点技术优化训练流程,以降低计算成本。通过大量实验,CAPIVARA在涉及图像与葡萄牙语文本的零样本任务中达到最先进水平。我们展示了通过单GPU使用CAPIVARA微调预训练多语言CLIP 2小时,即可在其他低资源语言中实现显著提升的潜力。模型与代码发布于https://github.com/hiaac-nlp/CAPIVARA。