Data errors are widespread in real-world databases and severely impact downstream applications, such as machine learning pipelines or business analytics reports. Causes of such errors are manifold and can arise during both the design phase and the operational phase of a database. Some error types, such as missing values, duplicate tuples, or constraint violations, are widely recognized; others, such as disguised missing values or word transpositions, remain underexplored. Existing attempts to define and classify errors in data offer valuable but limited taxonomies, mostly informal and not covering the full range of error types. With the rise of AI, practitioners must increasingly detect and correct statistical errors such as bias and outliers, which are rarely considered within existing error taxonomies. This catalog presents a comprehensive list of 35 distinct error types, including both data errors (e.g., missing values, duplicate tuples) and error indicators (e.g., outliers, bias) for tabular data, classified into three non-overlapping categories: missing, incorrect, and redundant. For each error type, we provide a formal definition and practical example, and resolve terminological inconsistencies across related work. Our catalog enables researchers and practitioners to address various error types and systematically implement error-specific detection and cleaning strategies in data quality tools.
翻译:数据错误在实际数据库中普遍存在,并严重损害下游应用(如机器学习流水线或商业分析报告)。此类错误的成因复杂多样,可能源自数据库设计阶段与运行阶段。诸如缺失值、重复元组或约束违规等错误类型已被广泛认知;而伪装缺失值或词语换位等类型仍待深入探索。现有数据错误定义与分类的尝试虽具价值,但分类体系较为有限,大多为非正式定义且未覆盖全部错误类型。随着人工智能的兴起,从业者日益需要检测并修正统计性错误(如偏差与异常值),而这些内容在现有错误分类体系中鲜有涉及。本目录系统梳理了35种独立错误类型,涵盖表格数据的两种形态:数据错误(如缺失值、重复元组)与错误标识(如异常值、偏差),并将其归入三个互斥类别:缺失型、错误型与冗余型。针对每种错误类型,我们提供形式化定义与实践案例,并消解相关文献中的术语歧义。本目录旨在支持研究者与从业者系统应对各类数据错误,在数据质量工具中针对性实施错误检测与清洗策略。