We present our vision for developing an automated tool capable of translating visual properties observed in Machine Learning (ML) visualisations into Python assertions. The tool aims to streamline the process of manually verifying these visualisations in the ML development cycle, which is critical as real-world data and assumptions often change post-deployment. In a prior study, we mined $54,070$ Jupyter notebooks from Github and created a catalogue of $269$ semantically related visualisation-assertion (VA) pairs. Building on this catalogue, we propose to build a taxonomy that organises the VA pairs based on ML verification tasks. The input feature space comprises of a rich source of information mined from the Jupyter notebooks -- visualisations, Python source code, and associated markdown text. The effectiveness of various AI models, including traditional NLP4Code models and modern Large Language Models, will be compared using established machine translation metrics and evaluated through a qualitative study with human participants. The paper also plans to address the challenge of extending the existing VA pair dataset with additional pairs from Kaggle and to compare the tool's effectiveness with commercial generative AI models like ChatGPT. This research not only contributes to the field of ML system validation but also explores novel ways to leverage AI for automating and enhancing software engineering practices in ML.
翻译:我们提出开发一种自动化工具,能够将机器学习可视化中观察到的视觉属性转化为Python断言。该工具旨在简化机器学习开发周期中人工验证这些可视化的流程,这在部署后真实世界数据和假设经常变化的情况下至关重要。在先前的研究中,我们从GitHub挖掘了54,070个Jupyter笔记本,创建了一个包含269个语义相关的可视化-断言对的目录。基于此目录,我们提出构建一个按机器学习验证任务组织可视化-断言对的分类体系。输入特征空间包含从Jupyter笔记本中挖掘的丰富信息源——可视化、Python源代码及关联的Markdown文本。将使用成熟的机器翻译指标比较包括传统NLP4Code模型和现代大型语言模型在内的各种AI模型的有效性,并通过人类参与者的定性研究进行评价。本文还计划解决从Kaggle扩展现有可视化-断言对数据集的挑战,并与ChatGPT等商业生成式AI模型进行工具有效性对比。这项研究不仅为机器学习系统验证领域做出贡献,还探索了利用AI自动化和增强机器学习软件工程实践的新方法。