Standard natural language processing (NLP) pipelines operate on symbolic representations of language, which typically consist of sequences of discrete tokens. However, creating an analogous representation for ancient logographic writing systems is an extremely labor intensive process that requires expert knowledge. At present, a large portion of logographic data persists in a purely visual form due to the absence of transcription -- this issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representations of language offers a potential solution. We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages, featuring both transcribed and visual datasets for four writing systems along with annotations for tasks like classification, translation, and parsing. Our experiments compare systems that employ recent visual and text encoding strategies as backbones. The results demonstrate that visual representations outperform textual representations for some investigated tasks, suggesting that visual processing pipelines may unlock a large amount of cultural heritage data of logographic languages for NLP-based analyses.
翻译:标准的自然语言处理(NLP)流程基于语言的符号化表征进行操作,这种表征通常由离散标记序列构成。然而,为古代语标文字系统创建类似的表征是一个极其耗费人力的过程,需要专业知识。目前,由于缺乏转写,大量语标数据仍以纯粹的视觉形式存在——这一问题为寻求应用NLP工具包研究古代语标语言的研究人员带来了瓶颈:大部分相关数据是文字的图像。本文探讨直接处理语言的视觉表征是否提供了一种潜在的解决方案。我们介绍了LogogramNLP,这是首个支持对古代语标语言进行NLP分析的基准,包含四种文字系统的转写与视觉数据集,以及分类、翻译和解析等任务的标注。我们的实验比较了采用近期视觉与文本编码策略作为骨干的系统。结果表明,对于部分研究任务,视觉表征的表现优于文本表征,这表明视觉处理流程可能为基于NLP的分析解锁大量语标语言的文化遗产数据。