The inference cost of Large Language Models (LLMs) is a significant challenge due to their computational demands, specially on tasks requiring long outputs. However, natural language often contains redundancy, which presents an opportunity for optimization. We have observed that LLMs can generate distilled language-concise outputs that retain essential meaning, when prompted appropriately. We propose a framework for saving computational cost, in which a shorter distilled output from the LLM is reconstructed into a full narrative by a smaller model with lower inference costs. Our experiments show promising results, particularly in general knowledge domains with 20.58% saved tokens on average with tiny decrease in evaluation metrics, hinting that this approach can effectively balance efficiency and accuracy in language processing tasks.
翻译:大型语言模型(LLM)的推理成本因其计算需求而成为重大挑战,尤其在需要生成长输出的任务中。然而,自然语言常包含冗余性,这为优化提供了可能。我们观察到,在适当提示下,LLM能够生成精炼的语言——即保留核心含义的简洁输出。本文提出一种节约计算成本的框架:先由LLM生成较短的浓缩输出,再通过推理成本较低的小型模型将其重构为完整叙述。实验结果表明该方法具有良好前景,特别是在通用知识领域平均节省20.58%的令牌数,而评估指标仅轻微下降,这暗示该方案能在语言处理任务中有效平衡效率与准确性。