Writing assistance is an application closely related to human life and is also a fundamental Natural Language Processing (NLP) research field. Its aim is to improve the correctness and quality of input texts, with character checking being crucial in detecting and correcting wrong characters. From the perspective of the real world where handwriting occupies the vast majority, characters that humans get wrong include faked characters (i.e., untrue characters created due to writing errors) and misspelled characters (i.e., true characters used incorrectly due to spelling errors). However, existing datasets and related studies only focus on misspelled characters mainly caused by phonological or visual confusion, thereby ignoring faked characters which are more common and difficult. To break through this dilemma, we present Visual-C$^3$, a human-annotated Visual Chinese Character Checking dataset with faked and misspelled Chinese characters. To the best of our knowledge, Visual-C$^3$ is the first real-world visual and the largest human-crafted dataset for the Chinese character checking scenario. Additionally, we also propose and evaluate novel baseline methods on Visual-C$^3$. Extensive empirical results and analyses show that Visual-C$^3$ is high-quality yet challenging. The Visual-C$^3$ dataset and the baseline methods will be publicly available to facilitate further research in the community.
翻译:写作辅助是与人类生活密切相关的应用,也是自然语言处理(NLP)的基础研究方向之一。其目标在于提升输入文本的正确性与质量,其中字符检查在检测和纠正错误字符方面至关重要。从手写占据绝对多数的现实世界视角来看,人类写错的字符包括假字(即因书写错误产生的非真实字符)和错别字(即因拼写错误导致的真实字符误用)。然而,现有数据集及相关研究仅聚焦于主要由音近或形近混淆引起的错别字,从而忽略了更为常见且棘手的假字问题。为突破这一困境,我们提出了Visual-C$^3$——一个包含伪造与错别字的人工标注视觉中文文字检查数据集。据我们所知,Visual-C$^3$是首个面向真实世界视觉场景、且规模最大的人工构建中文文字检查数据集。此外,我们还提出并评估了多种基于Visual-C$^3$的基线方法。大量实验与分析结果表明,Visual-C$^3$既具有高质量又富有挑战性。Visual-C$^3$数据集及基线方法将公开提供,以推动该领域的进一步研究。