A Novel Dataset for Non-Destructive Inspection of Handwritten Documents

Forensic handwriting examination is a branch of Forensic Science that aims to examine handwritten documents in order to properly define or hypothesize the manuscript's author. These analysis involves comparing two or more (digitized) documents through a comprehensive comparison of intrinsic local and global features. If a correlation exists and specific best practices are satisfied, then it will be possible to affirm that the documents under analysis were written by the same individual. The need to create sophisticated tools capable of extracting and comparing significant features has led to the development of cutting-edge software with almost entirely automated processes, improving the forensic examination of handwriting and achieving increasingly objective evaluations. This is made possible by algorithmic solutions based on purely mathematical concepts. Machine Learning and Deep Learning models trained with specific datasets could turn out to be the key elements to best solve the task at hand. In this paper, we proposed a new and challenging dataset consisting of two subsets: the first consists of 21 documents written either by the classic ``pen and paper" approach (and later digitized) and directly acquired on common devices such as tablets; the second consists of 362 handwritten manuscripts by 124 different people, acquired following a specific pipeline. Our study pioneered a comparison between traditionally handwritten documents and those produced with digital tools (e.g., tablets). Preliminary results on the proposed datasets show that 90% classification accuracy can be achieved on the first subset (documents written on both paper and pen and later digitized and on tablets) and 96% on the second portion of the data. The datasets are available at https://iplab.dmi.unict.it/mfs/forensic-handwriting-analysis/novel-dataset-2023/.

翻译：法医笔迹检验是法医学的一个分支，旨在检查手写文档，以便正确定义或推测手稿的作者。这些分析涉及通过全面比较两个或多个（数字化）文档的内在局部和全局特征来判断。如果存在相关性并满足特定最佳实践，则可以确认所分析的文档由同一人书写。为创建能够提取和比较显著特征的复杂工具，人们开发了近乎全自动化的尖端软件，改进了法医笔迹检验并实现了日益客观的评估。这得益于基于纯数学概念的算法解决方案。使用特定数据集训练的机器学习与深度学习模型可能是解决当前任务的关键要素。本文提出一个新的具有挑战性的数据集，包含两个子集：第一个子集包含21份文档，分别通过传统的“纸笔”方式书写（随后数字化）以及通过平板电脑等常见设备直接采集；第二个子集包含由124个不同人员书写的362份手稿，并通过特定流程采集。我们的研究开创性地对传统手写文档与使用数字工具（如平板电脑）生成的文档进行了比较。对所提数据集的初步结果表明，在第一个子集（纸笔书写后数字化及平板电脑书写文档）上可实现90%的分类准确率，在第二部分数据上可实现96%的准确率。数据集可通过https://iplab.dmi.unict.it/mfs/forensic-handwriting-analysis/novel-dataset-2023/获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日