Supervised learning algorithms generally assume the availability of enough memory to store their data model during the training and test phases. However, in the Internet of Things, this assumption is unrealistic when data comes in the form of infinite data streams, or when learning algorithms are deployed on devices with reduced amounts of memory. In this paper, we adapt the online Mondrian forest classification algorithm to work with memory constraints on data streams. In particular, we design five out-of-memory strategies to update Mondrian trees with new data points when the memory limit is reached. Moreover, we design trimming mechanisms to make Mondrian trees more robust to concept drifts under memory constraints. We evaluate our algorithms on a variety of real and simulated datasets, and we conclude with recommendations on their use in different situations: the Extend Node strategy appears as the best out-of-memory strategy in all configurations, whereas different trimming mechanisms should be adopted depending on whether a concept drift is expected. All our methods are implemented in the OrpailleCC open-source library and are ready to be used on embedded systems and connected objects.
翻译:监督学习算法通常假设在训练和测试阶段拥有足够的内存来存储其数据模型。然而,在物联网场景中,当数据以无限数据流形式出现,或学习算法部署在内存有限的设备上时,这一假设并不现实。本文针对数据流场景下的内存约束问题,对在线蒙德里安森林分类算法进行了适应性改进。具体而言,我们设计了五种内存溢出策略,用于在内存达到上限时更新蒙德里安树并处理新数据点。此外,我们引入了修剪机制,使蒙德里安树在内存约束下对概念漂移具有更强的鲁棒性。我们在多种真实与模拟数据集上评估了所提算法,并基于不同场景给出了使用建议:扩展节点策略在所有配置中均表现最优,而是否预期发生概念漂移则决定了应采用何种修剪机制。所有方法均已集成至开源库OrpailleCC中,可直接用于嵌入式系统与联网设备。