Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. Their ever-increasing size, however, raised concerns about their effective deployment and the need for LLM compressions. This study introduces the Divergent Token metrics (DTMs), a novel approach for assessing compressed LLMs, addressing the limitations of traditional perplexity or accuracy measures that fail to accurately reflect text generation quality. DTMs focus on token divergence, that allow deeper insights into the subtleties of model compression, i.p. when evaluating component's impacts individually. Utilizing the First Divergent Token metric (FDTM) in model sparsification reveals that a quarter of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization FDTM suggests that over 80% of parameters can naively be transformed to int8 without special outlier management. These evaluations indicate the necessity of choosing appropriate compressions for parameters individually-and that FDTM can identify those-while standard metrics result in deteriorated outcomes.
翻译:大语言模型(LLM)凭借其卓越能力重塑了自然语言处理领域,但其日益庞大的规模引发了对其高效部署及压缩需求的关注。本研究提出发散性Token度量(DTM),这是一种评估压缩后LLM的新方法,旨在解决传统困惑度或准确率指标无法准确反映文本生成质量的局限性。DTM聚焦于Token发散特性,能够更深入地洞察模型压缩的细微差别,尤其是在单独评估组件影响时。利用首发散Token度量(FDTM)进行模型稀疏化实验表明,在Llama-2系列模型中,四分之一注意力组件可被剪除超过90%而仍保持最优性能。针对量化任务,FDTM发现超过80%的参数无需特殊离群值处理即可直接转换为int8格式。这些评估表明,需为不同参数单独选择合适压缩策略——FDTM可识别此类需求——而传统指标会导致劣化结果。