This paper presents Custom ZeroCLIP, a retrieval-augmented vision-language framework for zero-shot captioning of Indonesian traditional garments. The dataset contains 3,800 expert-annotated images from all 38 Indonesian provinces. Using a province-level inductive zero-shot protocol, the model is trained on 24 seen provinces, validated on 6 seen provinces, and evaluated on 8 unseen provinces. The framework combines a frozen CLIP ViT-B/32 image encoder, a CLIP text encoder, a BERT text encoder, and an LSTM caption decoder. During inference, unseen-province labels and captions are unavailable, and retrieval uses only captions from training provinces. No unseen-province image, label, or caption is used during training, validation, or retrieval-bank construction. Custom ZeroCLIP achieves a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859, outperforming existing baselines. Ablation results show that retrieval improves cultural vocabulary recovery with a 19.3\% METEOR gain, while human evaluation confirms stronger cultural accuracy and fluency. The results demonstrate the effectiveness of retrieval-augmented domain adaptation for culturally grounded caption generation in low-resource heritage settings. The dataset is publicly available at https://github.com/AnugrahAidinYotolembah/Traditional-Indonesian-Clothing-Captioning-Dataset.
翻译:本文提出Custom ZeroCLIP,一种面向印尼传统服饰零样本图像描述的检索增强视觉-语言框架。数据集包含来自印尼全部38个省份的3,800张专家标注图像。通过省份级归纳式零样本协议,模型在24个已知省份上训练、6个已知省份上验证,并在8个未知省份上评估。该框架融合冻结的CLIP ViT-B/32图像编码器、CLIP文本编码器、BERT文本编码器及LSTM描述解码器。推理阶段无法获取未知省份标签与描述,检索仅依赖训练省份的描述文本。训练、验证及检索库构建过程中未使用任何未知省份的图像、标签或描述。Custom ZeroCLIP取得0.8536的CLIPScore、0.3342的BLEU-4及0.4859的METEOR,超越现有基线模型。消融研究表明检索带来19.3%的METEOR提升,显著改善文化词汇恢复能力;人工评估证实其文化准确性与流畅度更强。实验结果证明检索增强领域自适应方法在低资源文化遗产场景下生成文化契合描述的有效性。数据集公开于https://github.com/AnugrahAidinYotolembah/Traditional-Indonesian-Clothing-Captioning-Dataset。