Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model's robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.
翻译:墓碑是具有丰富历史与文化价值的文物,承载着个体生命、社群记忆、历史叙事与艺术表达。然而,当今许多墓碑面临着严峻的保护挑战,包括物理侵蚀、人为破坏、环境退化及政治变迁。本文提出一种新颖的多模态墓碑数字化框架,旨在提升墓碑内容的解读、组织与检索能力。该方法利用视觉-语言模型将墓碑图像转化为结构化的墓碑意义表征,同时捕捉图像与文本信息。为进一步增强语义解析,我们引入检索增强生成技术以整合外部依赖要素,如地名、职业编码与本体概念。相较于传统基于OCR的流程,本方法将解析准确率从F1值36.1提升至89.5。我们还评估了模型在不同语言文化铭文上的鲁棒性,并通过图像融合模拟物理退化场景以检验其在噪声或受损条件下的性能。本研究首次尝试运用大规模视觉-语言模型实现墓碑理解的规范化,为文化遗产保护提供了新的技术路径。