Bridging the Modality Gap in Forensic Image Retrieval

Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.

翻译：自动化图像检索在现代取证分析中发挥着日益关键的作用，支撑着依赖视觉证据高效比对的工作流程。以往研究主要聚焦于多模态检索系统的开发与优化，但对其在多样化真实场景中的取证适用性评估关注有限。本研究提出一种统一检索框架，适配四项核心取证任务：（1）基于纹身查询图像的纹身图像检索；（2）通过人类专家文本描述引导的纹身检索（模拟证人口头描述纹身的常见场景）；（3）基于手绘草图的纹身检索；（4）基于取证面部素描的人脸检索。该系统利用多模态大语言模型自动为所有查询和图库图像生成结构化文本描述，随后采用句子变换器进行文本嵌入以完成比对。我们分别采用纯视觉嵌入、纯文本嵌入以及多模态融合策略（结合来自各任务最新视觉特征提取器的文本与图像相似度评分）评估检索性能。模态融合一致性地提升了检索精度与鲁棒性，尤其在视觉信息有限或存在噪声的场景中（如草图、局部纹身或碎片化证词）。本研究突显了统一多模态检索流水线的取证价值，并展示了现代多模态大语言模型如何赋能传统依赖人工专家分析的复杂取证任务。实验结果将多模态检索定位为支持涉及纹身、面部复合画像及证人描述调查流程的有前景工具。