Continual Learning with Dynamic Sparse Training: Exploring Algorithms for Effective Model Updates

Continual learning (CL) refers to the ability of an intelligent system to sequentially acquire and retain knowledge from a stream of data with as little computational overhead as possible. To this end; regularization, replay, architecture, and parameter isolation approaches were introduced to the literature. Parameter isolation using a sparse network which enables to allocate distinct parts of the neural network to different tasks and also allows to share of parameters between tasks if they are similar. Dynamic Sparse Training (DST) is a prominent way to find these sparse networks and isolate them for each task. This paper is the first empirical study investigating the effect of different DST components under the CL paradigm to fill a critical research gap and shed light on the optimal configuration of DST for CL if it exists. Therefore, we perform a comprehensive study in which we investigate various DST components to find the best topology per task on well-known CIFAR100 and miniImageNet benchmarks in a task-incremental CL setup since our primary focus is to evaluate the performance of various DST criteria, rather than the process of mask selection. We found that, at a low sparsity level, Erdos-R\'enyi Kernel (ERK) initialization utilizes the backbone more efficiently and allows to effectively learn increments of tasks. At a high sparsity level, unless it is extreme, uniform initialization demonstrates a more reliable and robust performance. In terms of growth strategy; performance is dependent on the defined initialization strategy and the extent of sparsity. Finally, adaptivity within DST components is a promising way for better continual learners.

翻译：持续学习（CL）指智能系统在数据流中顺序获取并保留知识的能力，同时尽可能降低计算开销。为此，学术界提出了正则化、回放、架构和参数隔离等方法。基于稀疏网络的参数隔离技术既能将神经网络的不同区域分配给不同任务，又可在相似任务间实现参数共享。动态稀疏训练（DST）作为发现并隔离各任务稀疏网络的主流方案，本研究首次通过实证分析探究不同DST组件在持续学习范式下的影响，旨在填补关键研究空白并阐明持续学习中DST的最优配置（若存在）。为此，我们开展系统性研究，在任务增量式持续学习框架下，基于CIFAR100和miniImageNet基准数据集，重点评估各类DST组件（而非掩码选择过程）以确定每个任务的最佳拓扑结构。研究发现：在低稀疏度水平下，Erdos-Rényi核（ERK）初始化能更高效利用骨干网络并有效学习任务增量；在高稀疏度水平（除非极端情况）下，均匀初始化展现出更可靠稳健的性能。就增长策略而言，其性能表现取决于所定义的初始化策略与稀疏程度。最后，DST组件内部的自适应性为构建更优持续学习器提供了可行方向。

相关内容

DST (Digital Sky Technologies)

关注 1

DST ( Digital Sky Technologies) 为一家俄罗斯科技、投资公司，创始人为 Yuri Milner。2010 年，DST 将旗下邮件服务和投资职能拆分为 http://Mail.ru Group 和 DST Global 两家公司。 DST 曾投资过 Facebook、Twitter、Groupon、Airbnb、Spotify、Zynga、Flipkart、阿里巴巴、京东等知名科技互联网企业。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日