The inference cost of Large Language Models (LLMs) is a significant challenge due to their computational demands, specially on tasks requiring long outputs. However, natural language often contains redundancy, which presents an opportunity for optimization. We have observed that LLMs can generate distilled language-concise outputs that retain essential meaning, when prompted appropriately. We propose TRIM, a pipeline for saving computational cost in which a shorter distilled output from the LLM is reconstructed into a full narrative by a smaller model with lower inference costs. Our experiments show promising results, particularly in general knowledge domains with 20.58% saved tokens on average with tiny decrease in evaluation metrics, hinting that this approach can effectively balance efficiency and accuracy in language processing tasks.
翻译:大型语言模型(LLM)的推理成本因其计算需求而成为重大挑战,尤其在需要生成长文本输出的任务中。然而,自然语言常包含冗余信息,这为优化提供了可能。我们观察到,在适当提示下,LLM能够生成经过提炼的简洁输出,同时保留核心语义信息。本文提出TRIM流程,通过让LLM生成较短的提炼输出,再由推理成本较低的小型模型将其重构为完整叙述,从而实现计算成本的节约。实验结果表明该方法具有良好前景,尤其在通用知识领域平均可节省20.58%的令牌数,而评估指标仅有微小下降,这提示该方法能在语言处理任务中有效平衡效率与准确性。