HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference

The escalating demand for high-fidelity, real-time inference in distributed edge-cloud environments necessitates aggressive model optimization to counteract severe latency and energy constraints. This paper introduces the Hybrid Quantization and Pruning (HQP) framework, a novel, integrated methodology designed to achieve synergistic model acceleration while adhering to strict quality guarantees. We detail a sensitivity-aware structural pruning algorithm that employs a dynamic weight sensitivity metric, derived from a highly efficient approximation of the Fisher Information Matrix (FIM), to guide the iterative removal of redundant filters. This pruning is strictly conditional, enforcing an adherence to a maximum permissible accuracy drop (Delta ax) before the model proceeds to 8-bit post-training quantization. This rigorous coordination is critical, as it ensures the resultant sparse model structure is maximally robust to quantization error and hardware-specific kernel optimization. Exhaustive evaluation across heterogeneous NVIDIA Jetson edge platforms, utilizing resource-efficient architectures like MobileNetV3 and ResNet-18, demonstrates that the HQP framework achieves a peak performance gain of 3.12 times inference speedup and a 55 percent model size reduction, while rigorously containing the accuracy drop below the 1.5 percent constraint. A comprehensive comparative analysis against conventional single-objective compression techniques validates the HQP framework as a superior, hardware-agnostic solution for deploying ultra-low-latency AI in resource-limited edge infrastructures.

翻译：分布式边缘-云环境中对高保真实时推理需求的不断增长，亟需采用激进的模型优化策略以应对严峻的延迟与能耗约束。本文提出混合量化与剪枝（HQP）框架，这是一种新颖的集成方法，旨在实现协同的模型加速，同时严格遵守质量保证。我们详述了一种敏感度感知的结构化剪枝算法，该算法采用基于费舍尔信息矩阵（FIM）高效近似导出的动态权重敏感度度量，以指导冗余滤波器的迭代移除。此剪枝过程具有严格的条件性，强制要求模型在进入8位训练后量化之前，其精度下降必须满足最大允许精度损失（Δax）的约束。这种严谨的协调至关重要，因为它确保了最终得到的稀疏模型结构对量化误差及硬件特定内核优化具有最大程度的鲁棒性。在异构的NVIDIA Jetson边缘平台上，利用MobileNetV3和ResNet-18等资源高效架构进行的详尽评估表明，HQP框架实现了最高3.12倍的推理加速峰值性能增益以及55%的模型大小缩减，同时严格将精度下降控制在1.5%的约束范围内。与传统单目标压缩技术的全面对比分析验证了HQP框架作为一种优越的、硬件无关的解决方案，适用于在资源受限的边缘基础设施中部署超低延迟AI。