Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model's robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.
翻译:墓碑是蕴含丰富历史与文化的文物,承载着个体生命、社区记忆、历史叙事与艺术表达。然而,当今许多墓碑面临严峻的保护挑战,包括物理侵蚀、人为破坏、环境退化及政治变迁。本文提出一种创新的多模态墓碑数字化框架,旨在提升墓碑内容的解读、组织与检索能力。该方法利用视觉-语言模型(VLMs)将墓碑图像转化为结构化的墓碑意义表征(TMRs),同时捕捉图像与文本信息。为增强语义解析的深度,我们引入检索增强生成(RAG)技术以整合外部依赖要素,如地名、职业编码与本体概念。相较于传统基于OCR的流程,本方法将解析准确率从F1值36.1提升至89.5。我们进一步评估了模型在不同语言文化铭文上的鲁棒性,并通过图像融合模拟物理退化场景,以检验其在噪声或损坏条件下的性能。本研究首次尝试利用大规模视觉-语言模型实现墓碑理解的系统化,为文化遗产保护提供了新的技术路径。