Towards Real-World Writing Assistance: A Chinese Character Checking Benchmark with Faked and Misspelled Characters

Writing assistance is an application closely related to human life and is also a fundamental Natural Language Processing (NLP) research field. Its aim is to improve the correctness and quality of input texts, with character checking being crucial in detecting and correcting wrong characters. From the perspective of the real world where handwriting occupies the vast majority, characters that humans get wrong include faked characters (i.e., untrue characters created due to writing errors) and misspelled characters (i.e., true characters used incorrectly due to spelling errors). However, existing datasets and related studies only focus on misspelled characters mainly caused by phonological or visual confusion, thereby ignoring faked characters which are more common and difficult. To break through this dilemma, we present Visual-C$^3$, a human-annotated Visual Chinese Character Checking dataset with faked and misspelled Chinese characters. To the best of our knowledge, Visual-C$^3$ is the first real-world visual and the largest human-crafted dataset for the Chinese character checking scenario. Additionally, we also propose and evaluate novel baseline methods on Visual-C$^3$. Extensive empirical results and analyses show that Visual-C$^3$ is high-quality yet challenging. The Visual-C$^3$ dataset and the baseline methods will be publicly available to facilitate further research in the community.

翻译：写作辅助是与人类生活密切相关的应用，也是自然语言处理（NLP）的基础研究方向之一。其目标在于提升输入文本的正确性与质量，其中字符检查在检测和纠正错误字符方面至关重要。从手写占据绝对多数的现实世界视角来看，人类写错的字符包括假字（即因书写错误产生的非真实字符）和错别字（即因拼写错误导致的真实字符误用）。然而，现有数据集及相关研究仅聚焦于主要由音近或形近混淆引起的错别字，从而忽略了更为常见且棘手的假字问题。为突破这一困境，我们提出了Visual-C$^3$——一个包含伪造与错别字的人工标注视觉中文文字检查数据集。据我们所知，Visual-C$^3$是首个面向真实世界视觉场景、且规模最大的人工构建中文文字检查数据集。此外，我们还提出并评估了多种基于Visual-C$^3$的基线方法。大量实验与分析结果表明，Visual-C$^3$既具有高质量又富有挑战性。Visual-C$^3$数据集及基线方法将公开提供，以推动该领域的进一步研究。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日