Oscar Wilde said, "The difference between literature and journalism is that journalism is unreadable, and literature is not read." Unfortunately, The digitally archived journalism of Oscar Wilde's 19th century often has no or poor quality Optical Character Recognition (OCR), reducing the accessibility of these archives and making them unreadable both figuratively and literally. This paper helps address the issue by performing OCR on "The Nineteenth Century Serials Edition" (NCSE), an 84k-page collection of 19th-century English newspapers and periodicals, using Pixtral 12B, a pre-trained image-to-text language model. The OCR capability of Pixtral was compared to 4 other OCR approaches, achieving a median character error rate of 1%, 5x lower than the next best model. The resulting NCSE v2.0 dataset features improved article identification, high-quality OCR, and text classified into four types and seventeen topics. The dataset contains 1.4 million entries, and 321 million words. Example use cases demonstrate analysis of topic similarity, readability, and event tracking. NCSE v2.0 is freely available to encourage historical and sociological research. As a result, 21st-century readers can now share Oscar Wilde's disappointment with 19th-century journalistic standards, reading the unreadable from the comfort of their own computers.
翻译:奥斯卡·王尔德曾说:"文学与新闻的区别在于新闻不堪卒读,而文学无人问津。"遗憾的是,王尔德所处的19世纪新闻文献在数字化存档时,往往缺乏光学字符识别(OCR)或OCR质量低劣,这既降低了档案的可访问性,也使其在象征意义和字面意义上都变得"不可读"。本文通过使用预训练图像到文本语言模型Pixtral 12B,对包含8.4万页19世纪英文报纸和期刊的"十九世纪连续出版物汇编"(NCSE)进行OCR处理,以应对该问题。研究将Pixtral的OCR能力与其他四种OCR方法进行比较,其中位字符错误率低至1%,较次优模型降低5倍。由此产生的NCSE v2.0数据集具备改进的文章识别功能、高质量的OCR文本,并按四种文体类型和十七个主题进行分类。该数据集包含140万条条目,总计3.21亿词。示例用例展示了对主题相似性、可读性和事件追踪的分析。NCSE v2.0免费开放以促进历史学与社会学研究。由此,21世纪的读者如今可以安坐电脑前,亲身体验王尔德对19世纪新闻标准的失望,真正"解读"这些曾经不可读的文献。