Since ChatGPT released its API for public use, the number of applications built on top of commercial large language models (LLMs) increase exponentially. One popular usage of such models is leveraging its in-context learning ability and generating responses given user queries leveraging knowledge obtained by retrieval augmentation. One problem of deploying commercial retrieval-augmented LLMs is the cost due to the additionally retrieved context that largely increases the input token size of the LLMs. To mitigate this, we propose a token compression scheme that includes two methods: summarization compression and semantic compression. The first method applies a T5-based model that is fine-tuned by datasets generated using self-instruct containing samples with varying lengths and reduce token size by doing summarization. The second method further compresses the token size by removing words with lower impact on the semantic. In order to adequately evaluate the effectiveness of the proposed methods, we propose and utilize a dataset called Food-Recommendation DB (FRDB) focusing on food recommendation for women around pregnancy period or infants. Our summarization compression can reduce 65% of the retrieval token size with further 0.3% improvement on the accuracy; semantic compression provides a more flexible way to trade-off the token size with performance, for which we can reduce the token size by 20% with only 1.6% of accuracy drop.
翻译:自ChatGPT公开发布API以来,基于商业大语言模型的应用数量呈指数级增长。这类模型的一种常见用法是利用其上下文学习能力,通过检索增强获取知识,并针对用户查询生成响应。部署商业化检索增强大语言模型的一个问题是成本问题:额外检索到的上下文会显著增加模型的输入令牌数量。为解决此问题,我们提出一种令牌压缩方案,包含两种方法:摘要压缩和语义压缩。第一种方法采用基于T5的模型,该模型通过使用自指令生成的数据集进行微调,数据集包含不同长度的样本,并通过摘要生成减少令牌数量。第二种方法通过移除对语义影响较小的词语进一步压缩令牌数量。为充分评估所提方法的有效性,我们构建并使用一个名为"食品推荐数据库"的数据集,该数据集聚焦于孕期女性及婴幼儿食品推荐。我们的摘要压缩可将检索令牌数量减少65%,同时准确率提升0.3%;语义压缩则提供了更灵活的令牌数量与性能权衡方案,可在仅损失1.6%准确率的情况下将令牌数量减少20%。