Conventional rule learning algorithms aim at finding a set of simple rules, where each rule covers as many examples as possible. In this paper, we argue that the rules found in this way may not be the optimal explanations for each of the examples they cover. Instead, we propose an efficient algorithm that aims at finding the best rule covering each training example in a greedy optimization consisting of one specialization and one generalization loop. These locally optimal rules are collected and then filtered for a final rule set, which is much larger than the sets learned by conventional rule learning algorithms. A new example is classified by selecting the best among the rules that cover this example. In our experiments on small to very large datasets, the approach's average classification accuracy is higher than that of state-of-the-art rule learning algorithms. Moreover, the algorithm is highly efficient and can inherently be processed in parallel without affecting the learned rule set and so the classification accuracy. We thus believe that it closes an important gap for large-scale classification rule induction.
翻译:传统规则学习算法旨在寻找一组简单规则,其中每条规则尽可能覆盖更多样本。本文指出,以这种方式发现的规则可能并非其所覆盖每个样本的最优解释。为此,我们提出一种高效算法,通过由一条特化循环和一条泛化循环组成的贪心优化过程,为每个训练样本寻找最优覆盖规则。收集这些局部最优规则后,通过过滤得到最终规则集,其规模远大于传统规则学习算法所学习的规则集。对于新样本的分类,通过从覆盖该样本的规则中选取最优规则完成。我们在小规模到极大规模数据集上的实验表明,该方法的平均分类准确率高于现有最优规则学习算法。此外,该算法高效且天然支持并行处理,不会影响学习到的规则集及分类准确率。因此,我们认为该方法弥补了大规模分类规则归纳领域的重要空白。