Scale is often attributed as one of the factors that cause an increase in the performance of LLMs, resulting in models with billion and trillion parameters. One of the limitations of such large models is the high computational requirements that limit their usage, deployment, and debugging in resource-constrained scenarios. Two commonly used alternatives to bypass these limitations are to use the smaller versions of LLMs (e.g. Llama 7B instead of Llama 70B) and lower the memory requirements by using quantization. While these approaches effectively address the limitation of resources, their impact on model performance needs thorough examination. In this study, we perform a comprehensive evaluation to investigate the effect of model scale and quantization on the performance. We experiment with two major families of open-source instruct models ranging from 7 billion to 70 billion parameters. Our extensive zero-shot experiments across various tasks including natural language understanding, reasoning, misinformation detection, and hallucination reveal that larger models generally outperform their smaller counterparts, suggesting that scale remains an important factor in enhancing performance. We found that larger models show exceptional resilience to precision reduction and can maintain high accuracy even at 4-bit quantization for numerous tasks and they serve as a better solution than using smaller models at high precision under similar memory requirements.
翻译:通常认为规模是导致大语言模型性能提升的因素之一,由此产生了具有数十亿乃至数万亿参数的模型。这类大型模型的局限性之一在于其高计算需求,在资源受限场景下限制其使用、部署和调试。为规避这些限制,两种常用方法是使用更小版本的大语言模型(例如使用Llama 7B替代Llama 70B)以及通过量化降低内存需求。虽然这些方法有效应对了资源限制问题,但它们对模型性能的影响需要深入审视。本研究开展全面评估,考察模型尺度和量化对性能的影响。我们实验了两种主要开源指令模型系列,参数规模从70亿到700亿不等。涵盖自然语言理解、推理、虚假信息检测和幻觉等任务的广泛零样本实验表明,大模型通常优于其小规模对应模型,表明规模仍是提升性能的重要因素。我们发现大模型对精度降低具有卓越鲁棒性,在多项任务中即使采用4比特量化仍能保持高精度;在相似内存需求下,采用大模型较低精度配置的效果优于采用高精度小模型。