With the ongoing integration of Machine Learning models into everyday life, e.g. in the form of the Internet of Things (IoT), the evaluation of learned models becomes more and more an important issue. Tree ensembles are one of the best black-box classifiers available and routinely outperform more complex classifiers. While the fast application of tree ensembles has already been studied in the literature for Intel CPUs, they have not yet been studied in the context of ARM CPUs which are more dominant for IoT applications. In this paper, we convert the popular QuickScorer algorithm and its siblings from Intel's AVX to ARM's NEON instruction set. Second, we extend our implementation from ranking models to classification models such as Random Forests. Third, we investigate the effects of using fixed-point quantization in Random Forests. Our study shows that a careful implementation of tree traversal on ARM CPUs leads to a speed-up of up to 9.4 compared to a reference implementation. Moreover, quantized models seem to outperform models using floating-point values in terms of speed in almost all cases, with a neglectable impact on the predictive performance of the model. Finally, our study highlights architectural differences between ARM and Intel CPUs and between different ARM devices that imply that the best implementation depends on both the specific forest as well as the specific device used for deployment.
翻译:随着机器学习模型日益融入日常生活(例如以物联网(IoT)的形式),学习模型的评估变得越来越重要。树集成是最佳黑盒分类器之一,其性能通常优于更复杂的分类器。尽管已有文献针对Intel CPU研究了树集成的快速应用,但在主导物联网应用的ARM CPU背景下,此类研究尚属空白。本文首先将流行的QuickScorer算法及其变体从Intel的AVX指令集移植到ARM的NEON指令集;其次,将实现从排序模型扩展到随机森林等分类模型;第三,研究了定点量化在随机森林中的应用效果。研究表明,在ARM CPU上精心实现的树遍历算法相比参考实现可带来高达9.4倍的加速比。此外,量化模型在速度方面几乎在所有情况下都优于使用浮点值的模型,且对模型预测性能的影响可忽略不计。最后,本研究揭示了ARM与Intel CPU之间以及不同ARM设备之间的架构差异,表明最佳实现方式取决于具体森林结构和部署所用的具体设备。