Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and multi-level semantic specification. We provide standardized data splits, comprehensive dataset analysis, and benchmark evaluations of representative supervised and zero-shot CIR methods. Experimental results reveal that existing CIR approaches, largely developed for general-domain imagery, struggle to effectively align sketch-based abstractions and hierarchical textual semantics with fine-grained Thangka images, particularly without in-domain supervision. We believe CIRThan offers a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage and other knowledge-specific visual domains. The dataset is publicly available at https://github.com/jinyuxu-whut/CIRThan.
翻译:组合图像检索(CIR)能够通过结合多种查询模态进行图像检索,但现有基准数据集主要关注通用领域图像,且依赖于带有简短文本修改的参考图像。因此,它们对需要细粒度语义推理、结构化视觉理解和领域特定知识的检索场景支持有限。本研究介绍了CIRThan,一个面向唐卡图像的草图+文本组合图像检索数据集。唐卡是一个根植于文化且具有特定知识背景的视觉领域,其特点在于复杂的结构、密集的符号元素以及依赖于领域的语义惯例。CIRThan包含2,287张高质量唐卡图像,每张图像均与一张手绘草图以及三个语义层次的分层文本描述配对,从而支持能够同时表达结构意图和多层次语义规范的组合查询。我们提供了标准化的数据划分、全面的数据集分析,以及对代表性监督和零样本CIR方法的基准评估。实验结果表明,主要为通用领域图像开发的现有CIR方法,难以有效地将基于草图的抽象表示和分层文本语义与细粒度的唐卡图像对齐,尤其是在没有领域内监督的情况下。我们相信CIRThan为推进草图+文本CIR、分层语义建模,以及在文化遗产和其他特定知识视觉领域中的多模态检索,提供了一个有价值的基准。该数据集已在 https://github.com/jinyuxu-whut/CIRThan 公开提供。