Language Models (LMs) such as BERT, have been shown to perform well on the task of identifying Named Entities (NE) in text. A BERT LM is typically used as a classifier to classify individual tokens in the input text, or to classify spans of tokens, as belonging to one of a set of possible NE categories. In this paper, we hypothesise that decoder-only Large Language Models (LLMs) can also be used generatively to extract both the NE, as well as potentially recover the correct surface form of the NE, where any spelling errors that were present in the input text get automatically corrected. We fine-tune two BERT LMs as baselines, as well as eight open-source LLMs, on the task of producing NEs from text that was obtained by applying Optical Character Recognition (OCR) to images of Japanese shop receipts; in this work, we do not attempt to find or evaluate the location of NEs in the text. We show that the best fine-tuned LLM performs as well as, or slightly better than, the best fine-tuned BERT LM, although the differences are not significant. However, the best LLM is also shown to correct OCR errors in some cases, as initially hypothesised.
翻译:诸如BERT等语言模型已被证明在文本中识别命名实体的任务上表现良好。BERT语言模型通常被用作分类器,对输入文本中的单个标记或标记片段进行分类,判断其是否属于某一组可能的命名实体类别。在本文中,我们假设仅包含解码器的大型语言模型也可以生成式地用于抽取命名实体,并可能恢复命名实体的正确表面形式,同时自动纠正输入文本中存在的任何拼写错误。我们微调了两个BERT语言模型作为基线,以及八个开源大型语言模型,用于从通过应用光学字符识别(OCR)技术对日本商店收据图像进行识别得到的文本中生成命名实体;在本研究中,我们未尝试寻找或评估命名实体在文本中的位置。结果表明,最优的微调大型语言模型的表现与最优的微调BERT语言模型相当或略优,尽管差异并不显著。然而,正如最初假设的那样,最优的大型语言模型在某些情况下还能纠正OCR错误。