Machine learning engineering (MLE) agents promise to automate end-to-end ML pipeline development from raw data and natural language instructions, potentially making ML accessible to non-technical domain experts. However, in sensitive and regulated domains, this abstraction creates a responsibility gap: end-users may lack visibility into design choices that affect correctness, robustness, fairness, and regulatory compliance. We argue that existing benchmarks are insufficient to assess whether MLE agents can be safely applied in such settings. We propose desiderata for a responsibility-centered evaluation framework and conduct an exploratory study on melanoma classification, focusing on fairness across skin tones as a responsibility constraint. When evaluating two recent MLE agents, we find that agent-generated pipelines show high variance and consistently underperform manually designed baselines in both predictive quality and fairness, despite fairness-oriented prompts. These preliminary results suggest that further research is needed towards redesigning MLE agents to allow humans to guide the search process and reliably assess the compliance and quality of the generated ML pipelines.
翻译:机器学习工程代理承诺从原始数据和自然语言指令中自动化端到端机器学习流水线的开发,这有望使非技术领域专家也能使用机器学习。然而,在敏感且受监管的领域中,这种抽象化造成了责任缺口:最终用户可能无法了解那些影响正确性、鲁棒性、公平性和法规遵从性的设计选择。我们认为现有基准不足以评估机器学习代理能否安全应用于此类场景。我们提出以责任为中心的评估框架的应具备要素,并围绕皮肤肿瘤分类开展探索性研究,将肤色间的公平性作为责任约束进行聚焦。在评估两个近期机器学习代理时,我们发现代理生成的流水线表现出高方差,且即便在面向公平的提示下,其在预测质量和公平性两方面均始终逊于人工设计的基准方法。这些初步结果表明,需要进一步研究以重新设计机器学习代理,使人类能够引导搜索过程,并可靠地评估所生成机器学习流水线的合规性与质量。