Semantic faults specific to the use of machine learning models are a common problem for machine learning developers, causing suboptimal predictions, high computational cost, or incorrect outputs. For example, one may erroneously use unscaled data to train a scale-sensitive model. Machine learning developers detect these faults after training their models and manually analyzing the results, making it an inefficient process. We propose a novel data-aware static analysis approach to detect semantic faults in machine learning code, allowing developers to reveal these bugs while writing code instead of after training the model. Our approach uses combined data and control flow analysis, and API contracts, enabling data-aware reasoning about machine learning code at a high level of abstraction. We highlight the potential of our solution by analyzing a sample of real-world machine learning notebooks, finding that we can detect faults that require a data-aware approach.
翻译:机器学习模型使用中特有的语义故障是机器学习开发者面临的常见问题,此类故障会导致预测效果欠佳、计算成本高昂或输出结果错误。例如,开发者可能会错误地使用未缩放数据训练对尺度敏感的模型。当前机器学习开发者需在模型训练完成后通过人工分析结果来检测这些故障,这一过程效率低下。我们提出了一种新颖的数据感知静态分析方法,用于检测机器学习代码中的语义故障,使开发者能够在编写代码阶段而非模型训练完成后揭示这些缺陷。该方法融合了数据流与控制流分析技术及API合约,能够在高抽象层级实现机器学习代码的数据感知推理。通过分析真实世界机器学习笔记本样本,我们验证了该方案的潜力——其能够检测出需要数据感知方法才能发现的故障。