A Sketch+Text Composed Image Retrieval Dataset for Thangka

Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and multi-level semantic specification. We provide standardized data splits, comprehensive dataset analysis, and benchmark evaluations of representative supervised and zero-shot CIR methods. Experimental results reveal that existing CIR approaches, largely developed for general-domain imagery, struggle to effectively align sketch-based abstractions and hierarchical textual semantics with fine-grained Thangka images, particularly without in-domain supervision. We believe CIRThan offers a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage and other knowledge-specific visual domains. The dataset is publicly available at https://github.com/jinyuxu-whut/CIRThan.

翻译：组合图像检索（CIR）能够通过结合多种查询模态进行图像检索，但现有基准数据集主要关注通用领域图像，且依赖于带有简短文本修改的参考图像。因此，它们对需要细粒度语义推理、结构化视觉理解和领域特定知识的检索场景支持有限。本研究介绍了CIRThan，一个面向唐卡图像的草图+文本组合图像检索数据集。唐卡是一个根植于文化且具有特定知识背景的视觉领域，其特点在于复杂的结构、密集的符号元素以及依赖于领域的语义惯例。CIRThan包含2,287张高质量唐卡图像，每张图像均与一张手绘草图以及三个语义层次的分层文本描述配对，从而支持能够同时表达结构意图和多层次语义规范的组合查询。我们提供了标准化的数据划分、全面的数据集分析，以及对代表性监督和零样本CIR方法的基准评估。实验结果表明，主要为通用领域图像开发的现有CIR方法，难以有效地将基于草图的抽象表示和分层文本语义与细粒度的唐卡图像对齐，尤其是在没有领域内监督的情况下。我们相信CIRThan为推进草图+文本CIR、分层语义建模，以及在文化遗产和其他特定知识视觉领域中的多模态检索，提供了一个有价值的基准。该数据集已在 https://github.com/jinyuxu-whut/CIRThan 公开提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

在回答之前先解释：组合视觉推理综述

专知会员服务

15+阅读 · 2025年8月27日

【ICML2025】QuRe：通过困难负样本采样实现查询相关的组合图像检索

专知会员服务

7+阅读 · 2025年7月20日

【WWW2025】ImageScope：通过大型多模态模型集体推理统一语言引导的图像检索

专知会员服务

12+阅读 · 2025年4月22日

【CVPR2025】CoLLM：面向组合图像检索的大语言模型

专知会员服务

13+阅读 · 2025年3月26日