Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. Their ever-increasing size, however, raised concerns about their effective deployment and the need for LLM compressions. This study introduces the Divergent Token metrics (DTMs), a novel approach for assessing compressed LLMs, addressing the limitations of traditional measures like perplexity that fail to accurately reflect text generation quality. DTMs focus on token divergence, providing deeper insights into the subtleties of model compression. Our results indicate that significant levels of precision and sparsity can be achieved without compromising text generation quality. Moreover, DTMs offers a more precise evaluation of each component's impact individually. Utilizing the First Divergent Token metric (FDTM) in model sparsification reveals that nearly 20% of all components can be pruned over 90%. In terms of quantization, the FDTM suggests that over 80% of parameters can be straightforwardly transformed to int8 without special outlier management.
翻译:大型语言模型(LLMs)以其令人印象深刻的能力重塑了自然语言处理领域。然而,其规模不断增大引发了关于有效部署及LLM压缩需求的担忧。本研究引入分歧Token度量(DTMs),这是一种评估压缩后LLM的新方法,旨在解决传统指标(如困惑度)无法准确反映文本生成质量的局限性。DTMs聚焦于Token分歧,为模型压缩的细微之处提供了更深入的洞察。我们的结果表明,在不牺牲文本生成质量的前提下,可以实现显著水平的精度与稀疏性。此外,DTMs能够更精确地评估每个组件的个体影响。利用第一分歧Token度量(FDTM)进行模型稀疏化时,发现近20%的组件可以被剪枝超过90%。在量化方面,FDTM表明超过80%的参数可以直接转换为int8格式,无需特殊异常值管理。