The capabilities and adoption of deep neural networks (DNNs) grow at an exhilarating pace: Vision models accurately classify human actions in videos and identify cancerous tissue in medical scans as precisely than human experts; large language models answer wide-ranging questions, generate code, and write prose, becoming the topic of everyday dinner-table conversations. Even though their uses are exhilarating, the continually increasing model sizes and computational complexities have a dark side. The economic cost and negative environmental externalities of training and serving models is in evident disharmony with financial viability and climate action goals. Instead of pursuing yet another increase in predictive performance, this dissertation is dedicated to the improvement of neural network efficiency. Specifically, a core contribution addresses the efficiency aspects during online inference. Here, the concept of Continual Inference Networks (CINs) is proposed and explored across four publications. CINs extend prior state-of-the-art methods developed for offline processing of spatio-temporal data and reuse their pre-trained weights, improving their online processing efficiency by an order of magnitude. These advances are attained through a bottom-up computational reorganization and judicious architectural modifications. The benefit to online inference is demonstrated by reformulating several widely used network architectures into CINs, including 3D CNNs, ST-GCNs, and Transformer Encoders. An orthogonal contribution tackles the concurrent adaptation and computational acceleration of a large source model into multiple lightweight derived models. Drawing on fusible adapter networks and structured pruning, Structured Pruning Adapters achieve superior predictive accuracy under aggressive pruning using significantly fewer learned weights compared to fine-tuning with pruning.
翻译:深度神经网络(DNN)的能力与应用正以令人振奋的速度增长:视觉模型能够精准识别视频中的人类动作,并在医学扫描中检测癌变组织,其准确度可与人类专家媲美;大语言模型能回答广泛的问题、生成代码并撰写散文,成为日常餐桌上的讨论话题。尽管其应用令人振奋,但模型规模与计算复杂度的持续增长也带来了负面影响。训练与部署模型的经济成本及对环境的负外部性,与财务可行性和气候行动目标之间存在明显矛盾。本论文不追求预测性能的进一步提升,而是致力于提高神经网络的效率。具体而言,一项核心贡献聚焦于在线推理过程中的效率优化。为此,我们通过四篇论文提出并深入探索了持续推理网络(Continual Inference Networks, CINs)的概念。CINs扩展了现有针对时空数据离线处理的前沿方法,并复用了其预训练权重,使在线处理效率提升了一个数量级。这些进步是通过自底向上的计算重组与审慎的架构改进实现的。通过将多种广泛使用的网络架构(包括3D CNN、ST-GCN和Transformer编码器)改造为CINs,我们验证了其对在线推理的优化效果。另一项正交贡献则致力于将大型源模型同时进行适配与计算加速,从而生成多个轻量化派生模型。结合可融合适配器网络与结构化剪枝,结构化剪枝适配器(Structured Pruning Adapters)在激进剪枝条件下,能够使用显著更少的可学习参数实现优于微调加剪枝的预测精度。