Optical Transformers - 专知论文

The rapidly increasing size of deep-learning models has caused renewed and growing interest in alternatives to digital computers to dramatically reduce the energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for optical computing. To test this idea, we performed small-scale optical experiments with a prototype accelerator to demonstrate that Transformer operations can run on optical hardware despite noise and errors. Using simulations, validated by our experiments, we then explored the energy efficiency of optical implementations of Transformers and identified scaling laws for model performance with respect to optical energy usage. We found that the optical energy per multiply-accumulate (MAC) scales as $\frac{1}{d}$ where $d$ is the Transformer width, an asymptotic advantage over digital systems. We conclude that with well-engineered, large-scale optical hardware, it may be possible to achieve a $100 \times$ energy-efficiency advantage for running some of the largest current Transformer models, and that if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical computers could have a $>8,000\times$ energy-efficiency advantage over state-of-the-art digital-electronic processors that achieve 300 fJ/MAC. We analyzed how these results motivate and inform the construction of future optical accelerators along with optics-amenable deep-learning approaches. With assumptions about future improvements to electronics and Transformer quantization techniques (5$\times$ cheaper memory access, double the digital--analog conversion efficiency, and 4-bit precision), we estimated that optical computers' advantage against current 300-fJ/MAC digital processors could grow to $>100,000\times$.

翻译：深度学习模型规模的快速增长引发了人们对替代数字计算机方案的重新关注与日益增长的兴趣，旨在大幅降低运行先进神经网络所需能耗。光学矩阵-向量乘法器最适合执行超大操作数计算，这暗示大规模Transformer模型或将成为光学计算的理想应用目标。为验证这一设想，我们利用原型加速器开展了小规模光学实验，证明即使在存在噪声和误差的条件下，Transformer运算仍可在光学硬件上执行。通过实验验证的仿真手段，我们进一步探究了Transformer光学实现方案的能效特征，并揭示了模型性能随光学能耗变化的标度律。研究发现，每次乘加运算（MAC）所需的光学能量按$\frac{1}{d}$标度变化（其中$d$为Transformer宽度），这相对于数字系统呈现出渐进优势。我们推断，通过精心设计的大规模光学硬件，运行现有最大规模Transformer模型或可实现$100\times$能效优势；若将模型与光学硬件同时扩展至万亿参数级别，相比当前实现300 fJ/MAC的先进数字电子处理器，光学计算机可具备$>8,000\times$的能效优势。我们分析了这些结果如何推动未来光学加速器与适配光学特性的深度学习方法的构建。假设未来电子技术与Transformer量化技术取得进步（内存访问成本降低5倍，数模转换效率翻倍，采用4比特精度），我们预计光学计算机相较当前300 fJ/MAC数字处理器的优势可增至$>100,000\times$。