Retrieval augmentation addresses many critical problems in large language models such as hallucination, staleness, and privacy leaks. However, running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text. We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages, significantly reducing computation during inference. Despite the potential loss of accuracy, our new calibration techniques and training objectives restore performance. Combined with offline and runtime compression, this only requires 127GB of disk space for encoding 3 billion tokens in Wikipedia. Our experiments show that on five knowledge-intensive NLP tasks, BTR accelerates state-of-the-art inference by up to 4x and reduces storage by over 100x while maintaining over 95% task performance.
翻译:检索增强技术解决了大语言模型中的诸多关键问题,如幻觉、信息滞后和隐私泄露。然而,由于需要处理大量检索文本,运行检索增强语言模型的推理速度较慢且难以扩展。本文提出二进制标记表示方法,使用1比特向量对文档中的每个标记进行预计算,从而显著降低推理阶段的计算量。尽管存在潜在精度损失,但通过全新的校准技术与训练目标,我们恢复了模型性能。结合离线与运行时压缩方案,该方法仅需127GB磁盘空间即可编码维基百科中30亿个标记。实验表明,在五项知识密集型自然语言处理任务中,BTR可将当前最优模型的推理速度提升最高4倍,存储空间压缩超100倍,同时保持95%以上的任务性能。