This paper introduces a new data analysis method for big data using a newly defined regression model named multiple model linear regression(MMLR), which separates input datasets into subsets and construct local linear regression models of them. The proposed data analysis method is shown to be more efficient and flexible than other regression based methods. This paper also proposes an approximate algorithm to construct MMLR models based on $(\epsilon,\delta)$-estimator, and gives mathematical proofs of the correctness and efficiency of MMLR algorithm, of which the time complexity is linear with respect to the size of input datasets. This paper also empirically implements the method on both synthetic and real-world datasets, the algorithm shows to have comparable performance to existing regression methods in many cases, while it takes almost the shortest time to provide a high prediction accuracy.
翻译:本文提出了一种面向大数据的新型数据分析方法,该方法采用新定义的回归模型——多模型线性回归(MMLR),将输入数据集划分为子集并构建各子集的局部线性回归模型。与其它基于回归的方法相比,所提出的数据分析方法具有更高的效率和灵活性。本文还提出了一种基于$(\epsilon,\delta)$-估计量的近似算法来构建MMLR模型,并给出了MMLR算法正确性与效率的数学证明,该算法的时间复杂度与输入数据集规模呈线性关系。通过在合成数据集和真实数据集上的实验验证,该方法在多数情况下具有与现有回归方法相当的性能,同时能以近乎最短的时间实现高预测精度。