This paper introduces Fetch-A-Set (FAS), a comprehensive benchmark tailored for legislative historical document analysis systems, addressing the challenges of large-scale document retrieval in historical contexts. The benchmark comprises a vast repository of documents dating back to the XVII century, serving both as a training resource and an evaluation benchmark for retrieval systems. It fills a critical gap in the literature by focusing on complex extractive tasks within the domain of cultural heritage. The proposed benchmark tackles the multifaceted problem of historical document analysis, including text-to-image retrieval for queries and image-to-text topic extraction from document fragments, all while accommodating varying levels of document legibility. This benchmark aims to spur advancements in the field by providing baselines and data for the development and evaluation of robust historical document retrieval systems, particularly in scenarios characterized by wide historical spectrum.
翻译:本文介绍了Fetch-A-Set(FAS),这是一个专为立法历史文档分析系统设计的综合性基准,旨在应对历史背景下大规模文档检索的挑战。该基准包含一个可追溯至17世纪的庞大文档库,既可作为检索系统的训练资源,也可作为评估基准。它通过聚焦文化遗产领域内复杂的抽取式任务,填补了文献中的一个关键空白。所提出的基准致力于解决历史文档分析中的多层面问题,包括针对查询的文本到图像检索,以及从文档片段中进行的图像到文本主题提取,同时兼顾了文档可读性的不同水平。该基准旨在通过为稳健的历史文档检索系统的开发与评估提供基线模型和数据,特别是在具有广泛历史跨度的场景中,推动该领域的进步。