Compressing high-capability Large Language Models (LLMs) has emerged as a favored strategy for resource-efficient inferences. While state-of-the-art (SoTA) compression methods boast impressive advancements in preserving benign task performance, the potential risks of compression in terms of safety and trustworthiness have been largely neglected. This study conducts the first, thorough evaluation of three (3) leading LLMs using five (5) SoTA compression techniques across eight (8) trustworthiness dimensions. Our experiments highlight the intricate interplay between compression and trustworthiness, revealing some interesting patterns. We find that quantization is currently a more effective approach than pruning in achieving efficiency and trustworthiness simultaneously. For instance, a 4-bit quantized model retains the trustworthiness of its original counterpart, but model pruning significantly degrades trustworthiness, even at 50% sparsity. Moreover, employing quantization within a moderate bit range could unexpectedly improve certain trustworthiness dimensions such as ethics and fairness. Conversely, extreme quantization to very low bit levels (3 bits) tends to reduce trustworthiness significantly. This increased risk cannot be uncovered by looking at benign performance alone, in turn, mandating comprehensive trustworthiness evaluation in practice. These findings culminate in practical recommendations for simultaneously achieving high utility, efficiency, and trustworthiness in LLMs. Code and models are available at https://decoding-comp-trust.github.io.
翻译:压缩高能力大语言模型已成为资源高效推理的主流策略。尽管前沿压缩方法在保持良性任务性能方面取得了显著进展,但压缩在安全性与可信度方面的潜在风险却长期被忽视。本研究首次对三种主流大语言模型,采用五种前沿压缩技术,在八个可信度维度上进行了全面评估。实验揭示了压缩与可信度之间复杂的相互作用,并呈现出若干值得关注的规律。我们发现,量化是目前比剪枝更能同时兼顾效率与可信度的技术路径。例如,4位量化模型能保持与原模型相当的可信度,而模型剪枝即使在50%稀疏度下也会显著损害可信度。更有趣的是,在适度比特范围内采用量化技术,竟能意外提升某些可信度维度(如伦理性与公平性)。反之,极端低比特量化(如3位)往往导致可信度大幅下降。这种风险仅通过良性任务性能评估无法被察觉,因此实践中必须进行全面的可信度评估。本研究最终为同时实现大语言模型的高效用、高效率与高可信度提出了实用建议。代码与模型已发布于 https://decoding-comp-trust.github.io。