Profiling data by plotting distributions and analyzing summary statistics is a critical step throughout data analysis. Currently, this process is manual and tedious since analysts must write extra code to examine their data after every transformation. This inefficiency may lead to data scientists profiling their data infrequently, rather than after each transformation, making it easy for them to miss important errors or insights. We propose continuous data profiling as a process that allows analysts to immediately see interactive visual summaries of their data throughout their data analysis to facilitate fast and thorough analysis. Our system, AutoProfiler, presents three ways to support continuous data profiling: it automatically displays data distributions and summary statistics to facilitate data comprehension; it is live, so visualizations are always accessible and update automatically as the data updates; it supports follow up analysis and documentation by authoring code for the user in the notebook. In a user study with 16 participants, we evaluate two versions of our system that integrate different levels of automation: both automatically show data profiles and facilitate code authoring, however, one version updates reactively and the other updates only on demand. We find that both tools facilitate insight discovery with 91% of user-generated insights originating from the tools rather than manual profiling code written by users. Participants found live updates intuitive and felt it helped them verify their transformations while those with on-demand profiles liked the ability to look at past visualizations. We also present a longitudinal case study on how AutoProfiler helped domain scientists find serendipitous insights about their data through automatic, live data profiles. Our results have implications for the design of future tools that offer automated data analysis support.
翻译:通过绘制分布图和汇总统计量来剖析数据,是贯穿数据分析全过程的关键步骤。目前,这一过程需要手动完成且繁琐耗时,因为分析者必须在每次数据变换后额外编写代码来检查数据。这种低效率可能导致数据科学家对数据进行剖析的频率降低,而非每次变换后都执行,从而容易遗漏重要的错误或洞见。我们提出持续数据剖析这一流程,它允许分析者在整个数据分析过程中即时查看数据的交互式可视化摘要,以促进快速且全面的分析。我们的系统AutoProfiler提供了三种支持持续数据剖析的方式:自动展示数据分布和汇总统计量以辅助数据理解;具有实时性,可视化结果始终可访问并随数据更新自动刷新;通过在笔记本中为用户自动生成代码,支持后续分析与文档记录。在一项包含16名参与者的用户研究中,我们评估了集成不同自动化程度的两个系统版本:两者均自动展示数据画像并辅助代码生成,但一个版本采用响应式更新,另一版本仅在需要时更新。我们发现,两种工具均能促进洞见发现,其中91%的用户生成洞见源自工具而非用户编写的手动剖析代码。参与者认为实时更新直观易懂,有助于验证其数据变换;而偏好按需更新的参与者则欣赏能够回顾历史可视化结果的功能。此外,我们还通过纵向案例分析展示了AutoProfiler如何通过自动实时的数据画像,帮助领域科学家意外发现其数据中的洞见。我们的研究结果对未来提供自动化数据分析支持的工具设计具有启示意义。