Since ChatGPT released its API for public use, the number of applications built on top of commercial large language models (LLMs) increase exponentially. One popular usage of such models is leveraging its in-context learning ability and generating responses given user queries leveraging knowledge obtained by retrieval augmentation. One problem of deploying commercial retrieval-augmented LLMs is the cost due to the additionally retrieved context that largely increases the input token size of the LLMs. To mitigate this, we propose a token compression scheme that includes two methods: summarization compression and semantic compression. The first method applies a T5-based model that is fine-tuned by datasets generated using self-instruct containing samples with varying lengths and reduce token size by doing summarization. The second method further compresses the token size by removing words with lower impact on the semantic. In order to adequately evaluate the effectiveness of the proposed methods, we propose and utilize a dataset called Food-Recommendation DB (FRDB) focusing on food recommendation for women around pregnancy period or infants. Our summarization compression can reduce 65% of the retrieval token size with further 0.3% improvement on the accuracy; semantic compression provides a more flexible way to trade-off the token size with performance, for which we can reduce the token size by 20% with only 1.6% of accuracy drop.
翻译:自ChatGPT开放应用程序接口(API)供公众使用以来,基于商业大语言模型(LLMs)构建的应用数量呈指数级增长。这类模型的一种常见用法是利用其上下文学习能力,并结合检索增强获取的知识,根据用户查询生成响应。部署商用检索增强型大语言模型的一个问题是成本问题,因为额外检索的上下文会显著增加LLM的输入令牌数量。为缓解这一问题,我们提出一种令牌压缩方案,包含两种方法:摘要压缩和语义压缩。第一种方法采用基于T5的模型,该模型通过使用自指令(self-instruct)生成的数据集进行微调,该数据集包含长度各异的样本,通过执行摘要来减少令牌数量。第二种方法通过移除对语义影响较小的词语,进一步压缩令牌数量。为充分评估所提方法的有效性,我们提出并利用一个名为食品推荐数据库(FRDB)的数据集,该数据集聚焦于孕期前后女性及婴儿的食品推荐。我们的摘要压缩可将检索令牌数量减少65%,同时准确率进一步提升0.3%;语义压缩提供了一种更灵活的方式,可在令牌数量与性能之间进行权衡,我们能够将令牌数量减少20%,而准确率仅下降1.6%。