Multivariate temporal, or time, series classification is, in a way, the temporal generalization of (numeric) classification, as every instance is described by multiple time series instead of multiple values. Symbolic classification is the machine learning strategy to extract explicit knowledge from a data set, and the problem of symbolic classification of multivariate temporal series requires the design, implementation, and test of ad-hoc machine learning algorithms, such as, for example, algorithms for the extraction of temporal versions of decision trees. One of the most well-known algorithms for decision tree extraction from categorical data is Quinlan's ID3, which was later extended to deal with numerical attributes, resulting in an algorithm known as C4.5, and implemented in many open-sources data mining libraries, including the so-called Weka, which features an implementation of C4.5 called J48. ID3 was recently generalized to deal with temporal data in form of timelines, which can be seen as discrete (categorical) versions of multivariate time series, and such a generalization, based on the interval temporal logic HS, is known as Temporal ID3. In this paper we introduce Temporal C4.5, that allows the extraction of temporal decision trees from undiscretized multivariate time series, describe its implementation, called Temporal J48, and discuss the outcome of a set of experiments with the latter on a collection of public data sets, comparing the results with those obtained by other, classical, multivariate time series classification methods.
翻译:多变量时序分类在某种程度上是(数值型)分类的时序泛化,因为每个实例由多个时间序列而非多个数值描述。符号分类是从数据集中提取显式知识的机器学习策略,而多变量时序序列的符号分类问题需要设计、实现并测试专门的机器学习算法,例如用于提取时序版本决策树的算法。从类别型数据中提取决策树最著名的算法之一是奎因兰的ID3,其后被扩展为处理数值属性,即C4.5算法,并已在诸多开源数据挖掘库中实现,包括名为Weka的库,其中包含C4.5的实现版本J48。最近ID3被泛化为处理以时间线形式存在的时序数据(可视为多变量时间序列的离散/类别型版本),该泛化基于区间时序逻辑HS,被称为时序ID3。本文提出时序C4.5,允许从非离散化的多变量时间序列中提取时序决策树,描述其实现版本时序J48,并讨论使用后者在一组公开数据集上进行的实验结果,同时将结果与其他经典多变量时序分类方法进行对比。