Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and multi-level semantic specification. We provide standardized data splits, comprehensive dataset analysis, and benchmark evaluations of representative supervised and zero-shot CIR methods. Experimental results reveal that existing CIR approaches, largely developed for general-domain imagery, struggle to effectively align sketch-based abstractions and hierarchical textual semantics with fine-grained Thangka images, particularly without in-domain supervision. We believe CIRThan offers a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage and other knowledge-specific visual domains. The dataset is publicly available at https://github.com/jinyuxu-whut/CIRThan.
翻译:组合图像检索(CIR)通过融合多种查询模态实现图像检索,但现有基准主要集中于通用域图像,且依赖带有简短文本修改的参考图像。因此,它们对需要细粒度语义推理、结构化视觉理解及领域特定知识的检索场景支持有限。本文引入CIRThan——一个面向唐卡图像的草图+文本组合图像检索数据集。唐卡作为一种具有文化根基和知识特定性的视觉领域,以复杂结构、密集符号元素及领域依赖的语义惯例为特征。CIRThan包含2,287张高质量唐卡图像,每张图像均配有人工绘制的草图及三个语义层次的分层文本描述,支持同时表达结构意图与多层次语义规范的组合查询。我们提供了标准数据划分、全面的数据集分析,以及对代表性监督式和零样本CIR方法的基准评估。实验结果表明,现有主要为通用域图像开发的CIR方法,难以有效对齐基于草图的抽象表示、分层文本语义与细粒度唐卡图像,尤其在缺乏领域监督的情况下。我们相信CIRThan为推进草图+文本CIR、分层语义建模及文化遗产等知识特定视觉领域中的多模态检索提供了宝贵基准。该数据集公开于https://github.com/jinyuxu-whut/CIRThan。