Large Language Models (LLMs) with billions of parameters are known for their impressive predicting capabilities but require lots of resources to run. With their massive rise in popularity, even a small reduction in required resources could have an impact on environment. On the other hand, smaller models require fewer resources but may sacrifice accuracy. In this work, we are proposing an implementation of ``stairs'' assisted greedy generation. It is a modified assisted generation methodology that makes use of a smaller model's fast generation, large model's batch prediction, and "stairs" validation in order to achieve a speed up in prediction generation. Results show between 9.58 and 17.24 percent inference time reduction compared to a stand-alone large LLM prediction in a text generation task without a loss in accuracy.
翻译:拥有数十亿参数的大语言模型以其卓越的预测能力而闻名,但其运行需要大量资源。随着其受欢迎程度的急剧上升,即使所需资源的少量减少也可能对环境产生影响。另一方面,较小的模型需要较少的资源,但可能会牺牲准确性。在本工作中,我们提出了一种“阶梯”辅助贪心生成的实现方法。这是一种改进的辅助生成方法,它利用较小模型的快速生成、大模型的批量预测以及“阶梯”验证,以实现预测生成速度的提升。结果表明,在文本生成任务中,与独立运行的大型LLM预测相比,该方法在不损失准确性的前提下,推理时间减少了9.58%至17.24%。