Evaluating Machine Learning Models with NERO: Non-Equivariance Revealed on Orbits

Proper evaluations are crucial for better understanding, troubleshooting, interpreting model behaviors and further improving model performance. While using scalar-based error metrics provides a fast way to overview model performance, they are often too abstract to display certain weak spots and lack information regarding important model properties, such as robustness. This not only hinders machine learning models from being more interpretable and gaining trust, but also can be misleading to both model developers and users. Additionally, conventional evaluation procedures often leave researchers unclear about where and how model fails, which complicates model comparisons and further developments. To address these issues, we propose a novel evaluation workflow, named Non-Equivariance Revealed on Orbits (NERO) Evaluation. The goal of NERO evaluation is to turn focus from traditional scalar-based metrics onto evaluating and visualizing models equivariance, closely capturing model robustness, as well as to allow researchers quickly investigating interesting or unexpected model behaviors. NERO evaluation is consist of a task-agnostic interactive interface and a set of visualizations, called NERO plots, which reveals the equivariance property of the model. Case studies on how NERO evaluation can be applied to multiple research areas, including 2D digit recognition, object detection, particle image velocimetry (PIV), and 3D point cloud classification, demonstrate that NERO evaluation can quickly illustrate different model equivariance, and effectively explain model behaviors through interactive visualizations of the model outputs. In addition, we propose consensus, an alternative to ground truths, to be used in NERO evaluation so that model equivariance can still be evaluated with new, unlabeled datasets.

翻译：正确的评估对于更好地理解、诊断、解释模型行为以及进一步提升模型性能至关重要。虽然使用基于标量的误差指标可以快速概览模型性能，但它们往往过于抽象，难以展示某些薄弱环节，且缺乏关于模型重要属性（如鲁棒性）的信息。这不仅阻碍了机器学习模型的可解释性和信任度的提升，还可能误导模型开发者与用户。此外，传统的评估流程常常让研究人员不清楚模型在何处以及如何失效，这使得模型比较和进一步开发变得复杂。为了解决这些问题，我们提出了一种新颖的评估工作流程，名为“轨道上揭示的非等变性”（NERO）评估。NERO评估的目标是将关注点从传统的基于标量的指标转向评估和可视化模型的等变性，从而紧密捕捉模型的鲁棒性，并允许研究人员快速探究模型有趣或意外的行为。NERO评估包含一个与任务无关的交互式界面和一组称为NERO图的可视化工具，这些工具揭示了模型的等变性属性。通过将NERO评估应用于多个研究领域（包括二维数字识别、目标检测、粒子图像测速（PIV）和三维点云分类）的案例研究，证明NERO评估能够快速展示不同的模型等变性，并通过模型输出的交互式可视化有效解释模型行为。此外，我们提出了“共识”（consensus）作为真实标签（ground truths）的替代方案，用于NERO评估，从而使得模型等变性即使在没有标签的新数据集上也能被评估。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

33页PPT【AI+天气预测】，AI and Machine learning for weather predictions

专知会员服务

35+阅读 · 2022年3月5日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

116+阅读 · 2020年4月5日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日