The past decade has witnessed a plethora of works that leverage the power of visualization (VIS) to interpret machine learning (ML) models. The corresponding research topic, VIS4ML, keeps growing at a fast pace. To better organize the enormous works and shed light on the developing trend of VIS4ML, we provide a systematic review of these works through this survey. Since data quality greatly impacts the performance of ML models, our survey focuses specifically on summarizing VIS4ML works from the data perspective. First, we categorize the common data handled by ML models into five types, explain the unique features of each type, and highlight the corresponding ML models that are good at learning from them. Second, from the large number of VIS4ML works, we tease out six tasks that operate on these types of data (i.e., data-centric tasks) at different stages of the ML pipeline to understand, diagnose, and refine ML models. Lastly, by studying the distribution of 143 surveyed papers across the five data types, six data-centric tasks, and their intersections, we analyze the prospective research directions and envision future research trends.
翻译:过去十年间,涌现了大量利用可视化(VIS)技术解释机器学习(ML)模型的研究工作。相关研究主题VIS4ML持续快速发展。为系统梳理这些海量研究并揭示VIS4ML的发展趋势,本文通过综述对这些工作进行了系统性回顾。由于数据质量对机器学习模型性能影响重大,本综述特别聚焦于从数据视角总结VIS4ML工作。首先,我们将机器学习模型处理的常见数据分为五类,阐释每类数据的独特特征,并重点介绍擅长处理该类数据的对应机器学习模型;其次,从大量VIS4ML工作中提炼出六项在机器学习流水线不同阶段处理这些数据类型(即数据驱动型任务)的操作,旨在理解、诊断和优化机器学习模型;最后,通过分析143篇被调研论文在五种数据类型、六项数据驱动任务及其交叉领域的分布情况,我们探讨了潜在研究方向并展望未来发展趋势。