Data Selection: A Surprisingly Effective and General Principle for Building Small Interpretable Models

We present convincing empirical evidence for an effective and general strategy for building accurate small models. Such models are attractive for interpretability and also find use in resource-constrained environments. The strategy is to learn the training distribution instead of using data from the test distribution. The distribution learning algorithm is not a contribution of this work; we highlight the broad usefulness of this simple strategy on a diverse set of tasks, and as such these rigorous empirical results are our contribution. We apply it to the tasks of (1) building cluster explanation trees, (2) prototype-based classification, and (3) classification using Random Forests, and show that it improves the accuracy of weak traditional baselines to the point that they are surprisingly competitive with specialized modern techniques. This strategy is also versatile wrt the notion of model size. In the first two tasks, model size is identified by number of leaves in the tree and the number of prototypes respectively. In the final task involving Random Forests the strategy is shown to be effective even when model size is determined by more than one factor: number of trees and their maximum depth. Positive results using multiple datasets are presented that are shown to be statistically significant. These lead us to conclude that this strategy is both effective, i.e, leads to significant improvements, and general, i.e., is applicable to different tasks and model families, and therefore merits further attention in domains that require small accurate models.

翻译：我们提出了令人信服的实证证据，证明了一种构建精确小型模型的有效且通用策略。此类模型因其可解释性而具有吸引力，并可用于资源受限的环境。该策略是学习训练分布，而非使用测试分布的数据。分布学习算法并非本工作的贡献；我们强调这一简单策略在多种任务上的广泛实用性，因此这些严谨的实证结果才是我们的贡献。我们将该策略应用于以下任务：(1) 构建聚类解释树，(2) 基于原型的分类，以及(3) 使用随机森林进行分类，并表明它能将弱传统基线的准确性提升到令人惊讶地与专用现代技术竞争的水平。该策略在模型大小概念上也是灵活的。在前两个任务中，模型大小分别由树中的叶子数量和原型数量确定。在涉及随机森林的最终任务中，即使模型大小由多个因素（树的棵数及其最大深度）决定，该策略也被证明是有效的。我们使用多个数据集呈现了具有统计显著性的积极结果。这些结果使我们得出结论：该策略既有效（即能带来显著改进），又通用（即适用于不同任务和模型族），因此值得在需要小型精确模型的领域进一步关注。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日