This article optimizes the inference performance of the Qwen-1.8B model by performing Int8 quantization, vectorizing some operators in llama.cpp, and modifying the compilation script to improve the compiler optimization level. On the Yitian 710 experimental platform, the prefill performance is increased by 1.6 times, the decoding performance is increased by 24 times, the memory usage is reduced to 1/5 of the original, and the accuracy loss is almost negligible.
翻译:本文通过实施Int8量化、对llama.cpp中部分算子进行向量化处理,并修改编译脚本以提升编译器优化等级,对Qwen-1.8B模型的推理性能进行了优化。在倚天710实验平台上,预填充性能提升至1.6倍,解码性能提升至24倍,内存占用降至原有的1/5,且精度损失几乎可忽略不计。