Machine learning (ML) models can fail in unexpected ways in the real world, but not all model failures are equal. With finite time and resources, ML practitioners are forced to prioritize their model debugging and improvement efforts. Through interviews with 13 ML practitioners at Apple, we found that practitioners construct small targeted test sets to estimate an error's nature, scope, and impact on users. We built on this insight in a case study with machine translation models, and developed Angler, an interactive visual analytics tool to help practitioners prioritize model improvements. In a user study with 7 machine translation experts, we used Angler to understand prioritization practices when the input space is infinite, and obtaining reliable signals of model quality is expensive. Our study revealed that participants could form more interesting and user-focused hypotheses for prioritization by analyzing quantitative summary statistics and qualitatively assessing data by reading sentences.
翻译:机器学习(ML)模型在现实世界中可能以意料之外的方式失效,但并非所有模型错误都同等重要。由于时间和资源有限,机器学习从业者被迫优先安排模型调试与改进工作。通过对苹果公司13位机器学习从业者的访谈,我们发现从业者会构建目标明确的小型测试集,用以评估错误性质、影响范围及对用户的影响。我们在机器翻译模型的案例研究中基于这一洞察,开发了昂格尔——一种交互式可视化分析工具,帮助从业者优先安排模型改进。在面向7位机器翻译专家的用户研究中,我们使用昂格尔研究输入空间无限、获取模型质量可靠信号代价高昂时的优先级决策实践。研究揭示,参与者通过分析定量汇总统计指标,并结合阅读句子进行定性数据评估,能够形成更具洞察力且以用户为中心的优先级排序假设。