Instruct-Tuning Pretrained Causal Language Models for Ancient Greek Papyrology and Epigraphy

This article presents an experiment in fine-tuning a pretrained causal language model (Meta's Llama 3.1 8B Instruct) to assist with restoring missing or illegible characters in ancient Greek inscriptions and documentary papyri. Utilizing a straightforward instruction-based approach and a 95%/5% train/test split, the papyrus restoration model achieved a character error rate (CER) of 14.9%, a top-1 accuracy of 73.5%, and a top-20 accuracy of 86.0% for sequences up to 10 characters. A model was also fine-tuned for geographic attribution, reaching a top-1 accuracy of 66.4% and a top-3 accuracy of 79.9%. In chronological attribution, it demonstrated an average deviation of 21.7 years from the actual terminus post/ante quem, with a median deviation of 0 years. For inscriptions, the restoration model achieved a CER of 20.5%, a top-1 accuracy of 63.7%, and a top-20 accuracy of 83.0% for sequences up to 10 characters. In geographic attribution, it attained a top-1 accuracy of 75.0% and a top-3 accuracy of 83.7%, while in dating, it had an average deviation of 37.1 years and a median deviation of 3 years from the actual date range. Benchmarked against the state-of-the-art model (Ithaca) on a shared test set and on recently edited inscriptions, the instruction-tuned models excelled in text restoration, while also offering the practical advantage of ignoring spaces during reconstruction, which aligns with the scriptio continua of ancient textual artifacts. However, their performance in geographic and chronological attribution was lower than Ithaca's. To evaluate the approach in a more even setup, the instruction model was retrained with an 80%/10%/10% train-validation-test split, and still outperformed Ithaca in text restoration. The results suggest that fine-tuning larger pretrained causal language models using instruction templates for emendations and conjectures to ancient texts holds promise.

翻译：本文介绍了一项实验，通过微调预训练的因果语言模型（Meta的Llama 3.1 8B Instruct）来辅助恢复古希腊铭文与文书纸草中缺失或难以辨认的字符。采用基于指令的简单方法及95%/5%的训练/测试分割，纸草文本恢复模型在最多10个字符的序列上实现了14.9%的字符错误率（CER）、73.5%的top-1准确率及86.0%的top-20准确率。同时微调了地理归属模型，其top-1准确率达66.4%，top-3准确率达79.9%。在年代归属任务中，模型与实际年代界限（terminus post/ante quem）的平均偏差为21.7年，中位偏差为0年。对于铭文数据，恢复模型在最多10个字符的序列上实现了20.5%的字符错误率、63.7%的top-1准确率及83.0%的top-20准确率。地理归属任务中，top-1准确率达75.0%，top-3准确率达83.7%；年代判定任务中，与实际年代范围的平均偏差为37.1年，中位偏差为3年。在与当前最先进模型（Ithaca）在共享测试集及新近编辑铭文上的对比中，指令微调模型在文本恢复方面表现优异，同时具备在重建过程中忽略空格的实用优势——这与古代文本的连写（scriptio continua）特征相符。然而，其在地理与年代归属任务上的性能低于Ithaca模型。为在更均衡的设置下评估该方法，将指令模型按80%/10%/10%的训练-验证-测试分割重新训练，其在文本恢复上仍优于Ithaca。结果表明，采用指令模板对大型预训练因果语言模型进行微调，用于古代文本的校勘与推测具有良好前景。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日