In this paper, we present a pipeline for image extraction from historical documents using foundation models, and evaluate text-image prompts and their effectiveness on humanities datasets of varying levels of complexity. The motivation for this approach stems from the high interest of historians in visual elements printed alongside historical texts on the one hand, and from the relative lack of well-annotated datasets within the humanities when compared to other domains. We propose a sequential approach that relies on GroundDINO and Meta's Segment-Anything-Model (SAM) to retrieve a significant portion of visual data from historical documents that can then be used for downstream development tasks and dataset creation, as well as evaluate the effect of different linguistic prompts on the resulting detections.
翻译:本文提出了一种利用基础模型从历史文献中提取图像的流程,并评估了文本-图像提示及其在不同复杂度人文学科数据集上的有效性。该方法的动力一方面源于历史学家对印刻在历史文本旁可视化元素的高度关注,另一方面源于人文学科相较于其他领域缺乏高质量标注数据集。我们提出了一种顺序方法,依赖GroundDINO和Meta的Segment-Anything-Model(SAM)从历史文献中检索大量视觉数据,这些数据可用于下游开发任务和数据集创建,同时评估了不同语言提示对检测结果的影响。