Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.
翻译:基于Transformer的大语言模型在长序列上的推理因自注意力机制的二次复杂度而成本高昂且速度缓慢。我们提出星型注意力,这是一种两阶段块稀疏近似方法,通过将注意力分片至多个计算主机并最小化通信开销来提升计算效率。在第一阶段,上下文通过跨主机的块局部注意力进行并行处理。在第二阶段,查询与响应词元通过序列全局注意力机制关注所有先前缓存的词元。星型注意力能够无缝集成大多数采用全局注意力训练的Transformer大语言模型,在保持95-100%准确率的同时,将内存需求与推理时间降低最高达11倍。