The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn't require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the $\infty$-bench. In the Needle-in-a-Haystack task, On widely recognized benchmarks, Q-LLM improved upon the current SOTA by 7.0% on Mistral and achieves 100% on LLaMA3. Our code can be found in https://github.com/dvlab-research/Q-LLM.
翻译:大语言模型(LLM)理解和推理长上下文的能力对于多领域进步至关重要。然而,它们仍难以捕捉序列中的长距离依赖以深入理解语义。为解决此问题,我们提出了面向LLM的查询感知推理系统(Q-LLM),该系统设计用于模拟人类认知方式处理超长序列。通过聚焦于与给定查询相关的记忆数据,Q-LLM能够在固定窗口大小内准确捕获相关信息,并为查询提供精确答案。该系统无需额外训练,可无缝集成到任何LLM中。基于LLaMA3的Q-LLM(QuickLLaMA)可在30秒内读完《哈利·波特》并准确回答问题。在$\infty$-bench基准测试中,Q-LLM相较于当前最优方法在LLaMA3上提升了7.17%,在Mistral上提升了3.26%。在“大海捞针”任务中,Q-LLM在Mistral上将当前SOTA指标提升了7.0%,在LLaMA3上达到100%准确率。我们的代码公开于https://github.com/dvlab-research/Q-LLM。