Towards Scalable Visual Data Wrangling via Direct Manipulation

Data wrangling, the process of cleaning, transforming, and preparing data for analysis, is a well-known bottleneck in data science workflows. A wide range of data wrangling techniques have been proposed to mitigate this challenge. Of particular interest are visual data wrangling tools, in which users prepare data via graphical interactions (such as with visualizations) rather than requiring them to write scripts. We develop a visual data wrangling system, Buckaroo, that expands upon this paradigm by enabling the automatic discovery of interesting groups (e.g., Salary values for Country="Buthan") and identification of anomalies (e.g., missing values, outliers, and type mismatches) both within and across these groups. Crucially, this allows users to reason about how repairs applied to one group affect other groups in the dataset. A central challenge in visual data wrangling is scalability. Rendering entire datasets is often infeasible, yet showing only a small sample risks hiding rare but critical errors across groups. We address these challenges through carefully designed sampling strategies that prioritize errors, as well as novel aggregation techniques that support pan-and-zoom interactions over large datasets. Buckaroo maintains efficient indexing data structures and differential storage to localize anomaly detection and minimize recomputation. We demonstrate the applicability of our approach via an integration with the Hopara pan-and-zoom engine (enabling multi-layered navigation over large datasets without sacrificing interactivity). Finally, we explore our system's usability (via an expert review) and its scalability, finding that this design seems well matched with the challenges of this domain.

翻译：数据整理作为数据科学工作流中清洗、转换与准备数据的关键环节，已成为公认的瓶颈问题。为应对这一挑战，学界已提出多种数据整理技术。其中，视觉数据整理工具尤其值得关注，这类工具允许用户通过图形交互（如可视化界面）而非编写脚本的方式完成数据准备工作。本文开发的视觉数据整理系统Buckaroo在此范式基础上进行了扩展：系统能够自动发现数据中有意义的群组（例如国家="不丹"对应的薪资数值），并识别群组内部及跨群组的异常现象（如缺失值、离群值与类型不匹配）。该设计的核心价值在于，用户能够直观评估针对特定群组的修复操作如何影响数据集中其他群组。视觉数据整理面临的核心挑战在于可扩展性——完整渲染整个数据集往往不可行，而仅展示小规模样本又可能掩盖跨群组的罕见关键错误。我们通过精心设计的优先呈现错误的抽样策略，以及支持大规模数据集平移缩放交互的新型聚合技术来解决这些挑战。Buckaroo采用高效的索引数据结构与差分存储机制，实现了异常检测的局部化处理并最大限度减少了重复计算。通过集成Hopara平移缩放引擎（支持大规模数据集的多层级导航且不损失交互性），我们验证了该方法的实用性。最后，通过专家评审评估系统可用性并测试其可扩展性，结果表明该设计能有效应对该领域的关键挑战。