Kuzushiji, a pre-modern Japanese cursive script, can currently be read and understood by only a few thousand trained experts in Japan. With the rapid development of deep learning, researchers have begun applying Optical Character Recognition (OCR) techniques to transcribe Kuzushiji into modern Japanese. Although existing OCR methods perform well on clean pre-modern Japanese documents written in Kuzushiji, they often fail to consider various types of noise, such as document degradation and seals, which significantly affect recognition accuracy. To the best of our knowledge, no existing dataset specifically addresses these challenges. To address this gap, we introduce the Degraded Kuzushiji Documents with Seals (DKDS) dataset as a new benchmark for related tasks. We describe the dataset construction process, which required the assistance of a trained Kuzushiji expert, and define two benchmark tracks: (1) text and seal detection and (2) document binarization. For the text and seal detection track, we provide baseline results using several recent versions of the You Only Look Once (YOLO) models for detecting Kuzushiji characters and seals. For the document binarization track, we present baseline results from traditional binarization algorithms, traditional algorithms combined with K-means clustering, two state-of-the-art (SOTA) Generative Adversarial Network (GAN) methods, as well as our Conditional GAN (cGAN) baseline. The DKDS dataset and the implementation code for baseline methods are available at https://ruiyangju.github.io/DKDS.


翻译:古文书是一种日本近代以前的草书字体,目前仅日本少数经过专业训练的专家能够阅读和理解。随着深度学习的快速发展,研究者开始应用光学字符识别(OCR)技术将古文书转录为现代日文。尽管现有OCR方法在干净的古文书文档上表现良好,但它们往往未考虑各类噪声,如文档退化和印章,这些因素会显著影响识别准确率。据我们所知,目前尚无专门针对这些挑战的数据集。为填补这一空白,我们引入了带印章退化古文书(DKDS)数据集作为相关任务的新基准。我们描述了数据集的构建过程,该过程需要古文书专家的协助,并定义了两个基准任务:(1)文本与印章检测;(2)文档二值化。针对文本与印章检测任务,我们提供了使用多个最新版YOLO模型检测古文字符和印章的基线结果。针对文档二值化任务,我们展示了传统二值化算法、传统算法结合K-means聚类、两种最先进的生成对抗网络(GAN)方法以及我们提出的条件生成对抗网络(cGAN)基线的结果。DKDS数据集及基线方法的实现代码可在https://ruiyangju.github.io/DKDS获取。

0
下载
关闭预览

相关内容

数据集,又称为资料集、数据集合或资料集合,是一种由数据所组成的集合。
Data set(或dataset)是一个数据的集合,通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量,如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数,该数据集的数据可能包括一个或多个成员。
Top
微信扫码咨询专知VIP会员