Tree kernels have been proposed to be used in many areas as the automatic learning of natural language applications. In this paper, we propose a new linear time algorithm based on the concept of weighted tree automata for SubTree kernel computation. First, we introduce a new class of weighted tree automata, called Root-Weighted Tree Automata, and their associated formal tree series. Then we define, from this class, the SubTree automata that represent compact computational models for finite tree languages. This allows us to design a theoretically guaranteed linear-time algorithm for computing the SubTree Kernel based on weighted tree automata intersection. The key idea behind the proposed algorithm is to replace DAG reduction and nodes sorting steps used in previous approaches by states equivalence classes computation allowed in the weighted tree automata approach. Our approach has three major advantages: it is output-sensitive, it is free sensitive from the tree types (ordered trees versus unordered trees), and it is well adapted to any incremental tree kernel based learning methods. Finally, we conduct a variety of comparative experiments on a wide range of synthetic tree languages datasets adapted for a deep algorithm analysis. The obtained results show that the proposed algorithm outperforms state-of-the-art methods.
翻译:树核已被提出用于许多领域,例如自然语言应用的自动学习。本文提出了一种基于加权树自动机概念的新线性时间算法,用于子树核计算。首先,我们引入了一类新的加权树自动机,称为根加权树自动机,及其相关的形式树级数。然后,从该类中定义了子树自动机,这些自动机代表了有限树语言的紧凑计算模型。这使我们能够设计一种具有理论保证的线性时间算法,用于基于加权树自动机交集的子树核计算。所提算法的关键思想是用加权树自动机方法中允许的状态等价类计算,替代先前方法中使用的DAG简化与节点排序步骤。我们的方法具有三个主要优势:输出敏感性、对树类型(有序树与无序树)不敏感,并且能够很好地适应任何增量式基于树核的学习方法。最后,我们在多种适用于深度算法分析的合成树语言数据集上进行了一系列对比实验。结果表明,所提算法优于现有最先进方法。