We present convincing empirical evidence for an effective and general strategy for building accurate small models. Such models are attractive for interpretability and also find use in resource-constrained environments. The strategy is to learn the training distribution instead of using data from the test distribution. The distribution learning algorithm is not a contribution of this work; we highlight the broad usefulness of this simple strategy on a diverse set of tasks, and as such these rigorous empirical results are our contribution. We apply it to the tasks of (1) building cluster explanation trees, (2) prototype-based classification, and (3) classification using Random Forests, and show that it improves the accuracy of weak traditional baselines to the point that they are surprisingly competitive with specialized modern techniques. This strategy is also versatile wrt the notion of model size. In the first two tasks, model size is identified by number of leaves in the tree and the number of prototypes respectively. In the final task involving Random Forests the strategy is shown to be effective even when model size is determined by more than one factor: number of trees and their maximum depth. Positive results using multiple datasets are presented that are shown to be statistically significant. These lead us to conclude that this strategy is both effective, i.e, leads to significant improvements, and general, i.e., is applicable to different tasks and model families, and therefore merits further attention in domains that require small accurate models.
翻译:我们提出了令人信服的实证证据,证明了一种构建精确小型模型的有效且通用策略。此类模型因其可解释性而具有吸引力,并可用于资源受限的环境。该策略是学习训练分布,而非使用测试分布的数据。分布学习算法并非本工作的贡献;我们强调这一简单策略在多种任务上的广泛实用性,因此这些严谨的实证结果才是我们的贡献。我们将该策略应用于以下任务:(1) 构建聚类解释树,(2) 基于原型的分类,以及(3) 使用随机森林进行分类,并表明它能将弱传统基线的准确性提升到令人惊讶地与专用现代技术竞争的水平。该策略在模型大小概念上也是灵活的。在前两个任务中,模型大小分别由树中的叶子数量和原型数量确定。在涉及随机森林的最终任务中,即使模型大小由多个因素(树的棵数及其最大深度)决定,该策略也被证明是有效的。我们使用多个数据集呈现了具有统计显著性的积极结果。这些结果使我们得出结论:该策略既有效(即能带来显著改进),又通用(即适用于不同任务和模型族),因此值得在需要小型精确模型的领域进一步关注。