The inference cost of Large Language Models (LLMs) is a significant challenge due to their computational demands, specially on tasks requiring long outputs. However, natural language often contains redundancy, which presents an opportunity for optimization. We have observed that LLMs can generate distilled language-concise outputs that retain essential meaning, when prompted appropriately. We propose TRIM, a pipeline for saving computational cost in which a shorter distilled output from the LLM is reconstructed into a full narrative by a smaller model with lower inference costs. Our experiments show promising results, particularly in general knowledge domains with 20.58% saved tokens on average with tiny decrease in evaluation metrics, hinting that this approach can effectively balance efficiency and accuracy in language processing tasks.
翻译:大型语言模型(LLM)的推理成本因其计算需求而构成重大挑战,尤其在需要生成长文本输出的任务中。然而,自然语言常包含冗余信息,这为优化提供了可能。我们观察到,在适当提示下,LLM能够生成精炼的语言输出——即保留核心含义的简洁表达。本文提出TRIM,一种用于节省计算成本的流程:首先由LLM生成较短的提炼输出,随后通过推理成本较低的小型模型将其重建为完整叙述。实验结果表明该方法具有良好前景,特别是在通用知识领域,平均节省20.58%的令牌数,而评估指标仅轻微下降,这暗示该方案能在语言处理任务中有效平衡效率与准确性。