Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step -- a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm -- the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4-6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n >= 4K).
翻译:长上下文大语言模型(LLM)推理的瓶颈并非计算能力,而是在每个解码步骤扫描KV缓存时产生的O(n)内存带宽开销——这是一道任何算力提升都无法突破的壁垒。近期光子加速器在密集注意力计算中展现了显著吞吐量,但这些方法在处理长上下文时仍继承了与电子注意力相同的O(n)内存扩展特性。我们观察到,真正的杠杆点在于粗粒度块选择步骤:这一决定需获取哪些KV块的内存受限相似性搜索。我们首次发现,该任务在结构上天然适配光子广播-加权范式——查询向量通过无源分裂广播至所有候选块,特征签名准静态(匹配电光微环谐振器编程),且仅需排序信息(精度可放宽至4-6比特)。关键在于,光子优势随上下文长度增长:当N增加时,电子扫描成本线性上升,而光子评估保持O(1)。我们将这一洞见实例化为PRISM(基于微环权重的内积相似度光子排序引擎),一种薄膜铌酸锂(TFLN)相似度计算引擎。在Qwen2.5-7B模型上的硬件受限"大海捞针"评估表明,在k=32时,从4K至64K token范围内均保持100%准确率,64K上下文下流量降低16倍。PRISM在实用上下文长度(n ≥ 4K)下,相较GPU基线实现四个数量级的能效优势。