Production machine learning (ML) systems fail silently -- not with crashes, but through wrong decisions. While observability is recognized as critical for ML operations, there is a lack empirical evidence of what practitioners actually capture. This study presents empirical results on ML observability in practice through seven focus group sessions in several domains. We catalog the information practitioners systematically capture across ML systems and their environment and map how they use it to validate models, detect and diagnose faults, and explain observed degradations. Finally, we identify gaps in current practice and outline implications for tooling design and research to establish ML observability practices.
翻译:生产环境中的机器学习(ML)系统通常以静默方式失效——并非通过崩溃,而是通过错误决策。尽管可观测性被公认为机器学习运维的关键要素,但关于从业者实际采集哪些信息的实证证据仍显不足。本研究通过跨多个领域的七个焦点小组会议,呈现了实践中机器学习可观测性的实证结果。我们系统梳理了从业者在机器学习系统及其环境中采集的信息类型,并映射了他们如何利用这些信息进行模型验证、故障检测与诊断,以及对观测到的性能退化进行解释。最后,我们识别了当前实践中的不足,并就工具设计与研究方向提出建议,以建立完善的机器学习可观测性实践体系。