Performance and Complexity Trade-off Optimization of Speech Models During Training

In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metrics aligned with the task's objective. While the overall architecture is usually guided by prior knowledge of the task, the sizes of individual layers are often chosen heuristically. However, this approach does not guarantee an optimal trade-off between performance and computational complexity; consequently, post hoc methods such as weight quantization or model pruning are typically employed to reduce computational cost. This occurs because stochastic gradient descent (SGD) methods can only optimize differentiable functions, while factors influencing computational complexity, such as layer sizes and floating-point operations per second (FLOP/s), are non-differentiable and require modifying the model structure during training. We propose a reparameterization technique based on feature noise injection that enables joint optimization of performance and computational complexity during training using SGD-based methods. Unlike traditional pruning methods, our approach allows the model size to be dynamically optimized for a target performance-complexity trade-off, without relying on heuristic criteria to select which weights or structures to remove. We demonstrate the effectiveness of our method through three case studies, including a synthetic example and two practical real-world applications: voice activity detection and audio anti-spoofing. The code related to our work is publicly available to encourage further research.

翻译：在语音机器学习中，神经网络模型通常通过选择具有固定层大小和结构的架构来设计。这些模型随后进行训练，以最大化与任务目标对齐的指标性能。虽然整体架构通常由任务的先验知识指导，但各层的大小往往启发式地选择。然而，这种方法无法保证性能与计算复杂度之间的最优权衡；因此，通常会采用权重量化或模型剪枝等后处理方法以降低计算成本。出现这种情况的原因是：随机梯度下降（SGD）方法仅能优化可微函数，而影响计算复杂度的因素（如层大小和每秒浮点运算次数（FLOP/s））不可微，需要在训练过程中修改模型结构。我们提出了一种基于特征噪声注入的重新参数化技术，使得在训练过程中能够使用基于SGD的方法联合优化性能与计算复杂度。与传统剪枝方法不同，我们的方法允许模型大小为目标的性能-复杂度权衡进行动态优化，无需依赖启发式标准来选择移除哪些权重或结构。我们通过三个案例研究（包括一个合成示例和两个实际现实应用：语音活动检测和音频反欺诈）证明了该方法的有效性。相关代码已公开，以鼓励进一步研究。