We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign of REST that allows it to be effectively "compacted". REST is a drafting technique for speculative decoding based on retrieving exact n-gram matches of the most recent n tokens generated by the target LLM from a datastore. The key idea of CREST is to only store a subset of the smallest and most common n-grams in the datastore with the hope of achieving comparable performance with less storage space. We found that storing a subset of n-grams both reduces storage space and improves performance. CREST matches REST's accepted token length with 10.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptance length than REST using the same storage space on the HumanEval and MT Bench benchmarks.
翻译:本文提出CREST(基于检索的推测解码压缩方法),这是对REST框架的重新设计,使其能够实现有效“压缩”。REST是一种基于检索的推测解码草案生成技术,其原理是从数据存储库中检索与目标大语言模型最新生成的n个标记完全匹配的n元语法。CREST的核心思想是仅将最小且最频繁出现的n元语法子集存储于数据存储库,以期在减少存储空间的同时保持相当的性能。研究发现,存储n元语法子集不仅能降低存储空间,还能提升性能。在HumanEval和MT Bench基准测试中,CREST以10.6-13.5倍的存储空间缩减实现了与REST相当的接受标记长度,并在相同存储空间下获得了比REST高16.5-17.1%的接受长度。