The Human Factor in Data Cleaning: Exploring Preferences and Biases

Data cleaning is often framed as a technical preprocessing step, yet in practice it relies heavily on human judgment. We report results from a controlled survey study in which participants performed error detection, data repair and imputation, and entity matching tasks on census-inspired scenarios with known semantic validity. We find systematic evidence for several cognitive bias mechanisms in data cleaning. Framing effects arise when surface-level formatting differences (e.g., capitalization or numeric presentation) increase false-positive error flags despite unchanged semantics. Anchoring and adjustment bias appears when expert cues shift participant decisions beyond parity, consistent with salience and availability effects. We also observe the representativeness heuristic: atypical but valid attribute combinations are frequently flagged as erroneous, and in entity matching tasks, surface similarity produces a substantial false-positive rate with high confidence. In data repair, participants show a robust preference for leaving values missing rather than imputing plausible values, consistent with omission bias. In contrast, automation-aligned switching under strong contradiction does not exceed a conservative rare-error tolerance threshold at the population level, indicating that deference to automated recommendations is limited in this setting. Across scenarios, bias patterns persist among technically experienced participants and across diverse workflow practices, suggesting that bias in data cleaning reflects general cognitive tendencies rather than lack of expertise. These findings motivate human-in-the-loop cleaning systems that clearly separate representation from semantics, present expert or algorithmic recommendations non-prescriptively, and support reflective evaluation of atypical but valid cases.

翻译：数据清洗常被视作一项技术性预处理步骤，然而在实践中，其高度依赖人工判断。我们报告了一项受控调查研究的结果，其中参与者在具有已知语义有效性的、基于人口普查场景的任务中执行了错误检测、数据修复与插补以及实体匹配工作。我们发现了数据清洗中若干认知偏差机制的系统性证据。当表面格式差异（例如大小写或数字呈现方式）在语义不变的情况下增加误报错误标志时，框架效应便会出现。当专家提示导致参与者的决策偏离均衡水平，且与显著性和可得性效应一致时，锚定和调整偏差便显现出来。我们还观察到了代表性启发式：非典型但有效的属性组合常被标记为错误；在实体匹配任务中，表面相似性会导致较高的误报率且伴随高置信度。在数据修复中，参与者表现出一种强烈的偏好，即倾向于保留缺失值而非插补合理值，这与遗漏偏差一致。相反，在强烈矛盾下的自动化对齐切换，在总体层面上并未超过保守的罕见错误容差阈值，这表明在该环境下，对自动化建议的遵从是有限的。在各类场景中，偏差模式在有技术经验的参与者以及不同工作流程实践中均持续存在，这表明数据清洗中的偏差反映了普遍的认知倾向，而非专业知识不足。这些发现推动了人机协同的清洗系统的发展，这些系统需清晰分离表示与语义，以非指令性的方式呈现专家或算法建议，并支持对非典型但有效案例的反思性评估。