Retrieval augmentation addresses many critical problems in large language models such as hallucination, staleness, and privacy leaks. However, running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text. We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages, significantly reducing computation during inference. Despite the potential loss of accuracy, our new calibration techniques and training objectives restore performance. Combined with offline and runtime compression, this only requires 127GB of disk space for encoding 3 billion tokens in Wikipedia. Our experiments show that on five knowledge-intensive NLP tasks, BTR accelerates state-of-the-art inference by up to 4x and reduces storage by over 100x while maintaining over 95% task performance.
翻译:检索增强解决了大型语言模型中的诸多关键问题,如幻觉、信息陈旧和隐私泄露。然而,由于需要处理大量检索文本,运行检索增强语言模型时速度缓慢且难以扩展。我们引入了二进制令牌表示(BTR),该方法利用1比特向量对段落中的每个令牌进行预计算,显著减少了推理过程中的计算量。尽管存在潜在精度损失,但我们提出的新型校准技术和训练目标恢复了模型性能。结合离线与运行时压缩技术,仅需127GB磁盘空间即可编码维基百科中30亿个令牌。实验表明,在五项知识密集型自然语言处理任务中,BTR将现有最先进模型推理速度提升高达4倍,存储需求降低超过100倍,同时保持95%以上的任务性能。