Machine learning models with high accuracy on test data can still produce systematic failures, such as harmful biases and safety issues, when deployed in the real world. To detect and mitigate such failures, practitioners run behavioral evaluation of their models, checking model outputs for specific types of inputs. Behavioral evaluation is important but challenging, requiring that practitioners discover real-world patterns and validate systematic failures. We conducted 18 semi-structured interviews with ML practitioners to better understand the challenges of behavioral evaluation and found that it is a collaborative, use-case-first process that is not adequately supported by existing task- and domain-specific tools. Using these findings, we designed Zeno, a general-purpose framework for visualizing and testing AI systems across diverse use cases. In four case studies with participants using Zeno on real-world models, we found that practitioners were able to reproduce previous manual analyses and discover new systematic failures.
翻译:在测试数据上具有高准确率的机器学习模型部署到现实世界后,仍可能产生系统性故障,如有害偏差和安全问题。为检测与缓解此类故障,从业者需对其模型进行行为评估,即检查模型对特定类型输入的输出结果。行为评估虽至关重要却充满挑战,要求从业者发现真实世界模式并验证系统性故障。我们通过对18位机器学习从业者进行半结构化访谈,深入理解行为评估的挑战,发现这是一个协作性、用例优先的过程,而现有面向特定任务和领域的工具对此支持不足。基于这些发现,我们设计了Zeno——一个用于跨不同用例可视化与测试AI系统的通用框架。在四个案例研究中,参与者使用Zeno对真实世界模型进行评估,结果表明从业者不仅能够复现此前的手动分析,还能发现新的系统性故障。