Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing $\mathcal{LD}$ methods greatly impact the dynamic model's performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $\mathcal{LD}$ in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by $9.32\%$ and $2.25\%$ for high and no dropping cases with $33.3\%$ reduction in training time.
翻译:边缘设备在资源受限且多变的场景中运行,需要能够根据可用资源限制进行自适应的动态架构。为满足此类需求,层丢弃($\mathcal{LD}$)方法通常被用于将静态模型转化为动态模型,其通过跳过网络的部分结构来降低整体计算复杂度。然而,现有的$\mathcal{LD}$方法在低丢弃率和高丢弃率情况下会显著影响动态模型的性能,从而恶化了性能与计算量之间的权衡关系。为此,我们提出了一种基于蒸馏的层丢弃(DLD)框架,该框架以端到端的方式有效结合了知识蒸馏与$\mathcal{LD}$的能力,从而为动态语音网络实现了最先进的性能。我们在三个公开基准上使用包括Conformer和WavLM在内的知名语音识别方法进行了全面实验,结果证明了我们框架的有效性:在高丢弃和无丢弃情况下,词错误率分别降低了$9.32\%$和$2.25\%$,同时训练时间减少了$33.3\%$。