Large Language Models (LLMs) have garnered widespread attention due to their remarkable performance across various tasks. However, to mitigate the issue of hallucinations, LLMs often incorporate retrieval-augmented pipeline to provide them with rich external knowledge and context. Nevertheless, challenges stem from inaccurate and coarse-grained context retrieved from the retriever. Supplying irrelevant context to the LLMs can result in poorer responses, increased inference latency, and higher costs. This paper introduces a method called Instruction-Aware Contextual Compression, which filters out less informative content, thereby accelerating and enhancing the use of LLMs. The experimental results demonstrate that Instruction-Aware Contextual Compression notably reduces memory consumption and minimizes generation latency while maintaining performance levels comparable to those achieved with the use of the full context. Specifically, we achieved a 50% reduction in context-related costs, resulting in a 5% reduction in inference memory usage and a 2.2-fold increase in inference speed, with only a minor drop of 0.047 in Rouge-1. These findings suggest that our method strikes an effective balance between efficiency and performance.
翻译:大型语言模型(LLMs)凭借其在各类任务中的卓越表现而受到广泛关注。然而,为缓解幻觉问题,LLMs通常采用检索增强的流程,以提供丰富的外部知识与上下文。然而,挑战源于检索器所获取的上下文存在不准确与粗粒度的问题。向LLMs提供无关上下文可能导致响应质量下降、推理延迟增加以及成本上升。本文提出一种名为“指令感知上下文压缩”的方法,该方法通过过滤信息量较低的内容,从而加速并增强LLMs的使用。实验结果表明,指令感知上下文压缩在保持与使用完整上下文相当性能水平的同时,显著降低了内存消耗并最小化了生成延迟。具体而言,我们实现了上下文相关成本降低50%,推理内存使用减少5%,推理速度提升2.2倍,而Rouge-1指标仅轻微下降0.047。这些发现表明,我们的方法在效率与性能之间取得了有效的平衡。