PQuantML is a new open-source, hardware-aware neural network model compression library tailored to end-to-end workflows. Motivated by the need to deploy performant models to environments with strict latency constraints, PQuantML simplifies training of compressed models by providing a unified interface to apply pruning and quantization, either jointly or individually. The library implements multiple pruning methods with different granularities, as well as fixed-point quantization with support for High-Granularity Quantization. We evaluate PQuantML on representative tasks such as the jet substructure classification, so-called jet tagging, an on-edge problem related to real-time LHC data processing. Using various pruning methods with fixed-point quantization, PQuantML achieves substantial parameter and bit-width reductions while maintaining accuracy. The resulting compression is further compared against existing tools, such as QKeras and HGQ.
翻译:PQuantML 是一个全新的开源、硬件感知的神经网络模型压缩库,专为端到端工作流程设计。受在严格延迟约束环境中部署高性能模型的需求驱动,PQuantML 通过提供统一的接口来联合或单独应用剪枝与量化,从而简化了压缩模型的训练。该库实现了多种不同粒度的剪枝方法,以及支持高粒度量化的定点量化。我们在代表性任务上评估了 PQuantML,例如喷注子结构分类,即所谓的喷注标记,这是一个与实时大型强子对撞机数据处理相关的边缘问题。结合使用多种剪枝方法与定点量化,PQuantML 在保持精度的同时实现了参数和位宽的显著缩减。最终得到的压缩结果进一步与现有工具(如 QKeras 和 HGQ)进行了比较。