Machine Learning (ML) is increasingly used to automate impactful decisions, which leads to concerns regarding their correctness, reliability, and fairness. We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their ML pipelines. In contrast to existing work, our key idea is to extract "logical query plans" from ML pipeline code relying on popular libraries. Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code. First, we developed such an abstract ML pipeline representation together with machinery to extract it from Python code. Next, we used this representation to efficiently instrument static ML pipelines and apply provenance tracking, which enables lightweight screening for common data preparation issues. Finally, we built machinery to automatically rewrite ML pipelines to perform more advanced what-if analyses and proposed using multi-query optimisation for the resulting workloads. In future work, we aim to interactively assist data scientists as they work on their ML pipelines.
翻译:机器学习(ML)日益广泛地应用于自动化重要决策,这引发了对其正确性、可靠性与公平性的关切。我们设想构建高度自动化的软件平台,以协助数据科学家开发、验证、监控与分析其ML管道。与现有工作不同,我们的核心思想是从依赖主流库的ML管道代码中提取“逻辑查询计划”。基于这些计划,我们自动推断管道语义,并对ML管道进行检测与重写,从而在不要求数据科学家手动标注或改写代码的前提下,支持多样化的应用场景。首先,我们开发了此类抽象的ML管道表示方法,并构建了从Python代码中提取该表示的机制。其次,我们利用此表示对静态ML管道进行高效检测并应用溯源追踪,从而实现对常见数据准备问题的轻量级筛查。最后,我们构建了自动重写ML管道的机制,以执行更高级的假设分析,并针对生成的工作负载提出了采用多查询优化的方案。在未来的工作中,我们的目标是在数据科学家开发ML管道时提供交互式辅助。