Automated analysis of ancient coins has the potential to help researchers extract more historical insights from large collections of coins and to help collectors understand what they are buying or selling. Recent research in this area has shown promise in focusing on identification of semantic elements as they are commonly depicted on ancient coins, by using convolutional neural networks (CNNs). This paper is the first to apply the recently proposed Vision Transformer (ViT) deep learning architecture to the task of identification of semantic elements on coins, using fully automatic learning from multi-modal data (images and unstructured text). This article summarises previous research in the area, discusses the training and implementation of ViT and CNN models for ancient coins analysis and provides an evaluation of their performance. The ViT models were found to outperform the newly trained CNN models in accuracy.
翻译:古钱币的自动化分析具有帮助研究人员从大量钱币藏品中提取更多历史洞见的潜力,同时也能协助收藏者理解其买卖对象的特征。该领域近期研究通过使用卷积神经网络(CNN),在识别古钱币上常见的语义元素方面展现出良好前景。本文首次将最新提出的视觉Transformer(ViT)深度学习架构应用于钱币语义元素识别任务,采用基于多模态数据(图像与非结构化文本)的全自动学习方法。本文综述了该领域的已有研究,探讨了用于古钱币分析的ViT与CNN模型的训练与实现过程,并对它们的性能进行了评估。实验发现,ViT模型在准确率上优于新训练的CNN模型。