Archives play a crucial role in preserving information and knowledge, and the exponential growth of such data necessitates efficient and automated tools for managing and utilizing archive information resources. Archival applications involve managing massive data that are challenging to process and analyze. Although LLMs have made remarkable progress in diverse domains, there are no publicly available archives tailored LLM. Addressing this gap, we introduce ArcGPT, to our knowledge, the first general-purpose LLM tailored to the archival field. To enhance model performance on real-world archival tasks, ArcGPT has been pre-trained on massive and extensive archival domain data. Alongside ArcGPT, we release AMBLE, a benchmark comprising four real-world archival tasks. Evaluation on AMBLE shows that ArcGPT outperforms existing state-of-the-art models, marking a substantial step forward in effective archival data management. Ultimately, ArcGPT aims to better serve the archival community, aiding archivists in their crucial role of preserving and harnessing our collective information and knowledge.
翻译:档案在保存信息和知识方面发挥着至关重要的作用,而此类数据的指数级增长需要高效、自动化的工具来管理和利用档案信息资源。档案应用涉及管理海量数据,这些数据难以处理和分析。尽管大语言模型(LLM)在多个领域取得了显著进展,但目前尚无公开可用的档案专用LLM。为填补这一空白,我们引入ArcGPT——据我们所知,这是首个面向档案领域的通用大语言模型。为提升模型在真实档案任务中的性能,ArcGPT已在海量且广泛的档案领域数据上进行了预训练。与ArcGPT一同发布的还有AMBLE基准,包含四项真实档案任务。在AMBLE上的评估表明,ArcGPT优于现有最先进的模型,标志着在有效档案数据管理方面迈出了重要一步。最终,ArcGPT旨在更好地服务档案社区,助力档案工作者履行其保存与利用人类集体信息及知识的关键使命。