The Human Factor in Data Cleaning: Exploring Preferences and Biases

Data cleaning is often framed as a technical preprocessing step, yet in practice it relies heavily on human judgment. We report results from a controlled survey study in which participants performed error detection, data repair and imputation, and entity matching tasks on census-inspired scenarios with known semantic validity. We find systematic evidence for several cognitive bias mechanisms in data cleaning. Framing effects arise when surface-level formatting differences (e.g., capitalization or numeric presentation) increase false-positive error flags despite unchanged semantics. Anchoring and adjustment bias appears when expert cues shift participant decisions beyond parity, consistent with salience and availability effects. We also observe the representativeness heuristic: atypical but valid attribute combinations are frequently flagged as erroneous, and in entity matching tasks, surface similarity produces a substantial false-positive rate with high confidence. In data repair, participants show a robust preference for leaving values missing rather than imputing plausible values, consistent with omission bias. In contrast, automation-aligned switching under strong contradiction does not exceed a conservative rare-error tolerance threshold at the population level, indicating that deference to automated recommendations is limited in this setting. Across scenarios, bias patterns persist among technically experienced participants and across diverse workflow practices, suggesting that bias in data cleaning reflects general cognitive tendencies rather than lack of expertise. These findings motivate human-in-the-loop cleaning systems that clearly separate representation from semantics, present expert or algorithmic recommendations non-prescriptively, and support reflective evaluation of atypical but valid cases.

翻译：数据清洗常被视为技术性预处理步骤，但在实践中高度依赖人工判断。我们报告了一项受控调查研究的结果：参与者在已知语义有效性的普查模拟场景中执行错误检测、数据修复与填补以及实体匹配任务。我们发现了数据清洗中多种认知偏差机制的系统性证据。当表层格式差异（如大小写或数字呈现方式）导致误报错误标记而语义未变时，会出现框架效应；专家线索使参与者决策偏离对等标准时，则出现锚定与调整偏差，这与显著性效应和可得性效应一致。我们还观察到代表性启发式：非典型但有效的属性组合常被误标为错误；在实体匹配任务中，表层相似性会以高置信度产生大量误报。在数据修复中，参与者表现出强烈偏好保留缺失值而非填补合理数值，这与疏忽偏差一致。相比之下，在强烈矛盾情境下与自动化建议保持一致的转换行为，在群体层面未超过保守的罕见误差容限阈值，表明该场景中对自动化建议的遵从有限。在所有场景中，偏差模式在技术经验丰富的参与者及多样化工作流程中持续存在，表明数据清洗中的偏差反映的是普遍认知倾向而非专业能力缺失。这些发现启示我们应构建人在回路的清洗系统：清晰分离表征与语义，以非规定性方式呈现专家或算法建议，并支持对非典型但有效案例的反思性评估。