Tensors play a vital role in machine learning (ML) and often exhibit properties best explored while maintaining high-order. Efficiently performing ML computations requires taking advantage of sparsity, but generalized hardware support is challenging. This paper introduces FLAASH, a flexible and modular accelerator design for sparse tensor contraction that achieves over 25x speedup for a deep learning workload. Our architecture performs sparse high-order tensor contraction by distributing sparse dot products, or portions thereof, to numerous Sparse Dot Product Engines (SDPEs). Memory structure and job distribution can be customized, and we demonstrate a simple approach as a proof of concept. We address the challenges associated with control flow to navigate data structures, high-order representation, and high-sparsity handling. The effectiveness of our approach is demonstrated through various evaluations, showcasing significant speedup as sparsity and order increase.
翻译:张量在机器学习中扮演着重要角色,且其特性通常需在高阶状态下才能充分探索。高效执行机器学习计算需要利用稀疏性,但通用硬件支持颇具挑战。本文提出FLAASH,一种灵活且模块化的稀疏张量收缩加速器设计,针对深度学习工作负载实现了超过25倍的速度提升。该架构通过将稀疏点积(或其部分计算)分发至大量稀疏点积引擎(SDPE)来执行高维稀疏张量收缩。其内存结构与任务分发机制均可定制,本文作为概念验证展示了一种简单实现方案。我们解决了与数据导航结构、高维表示及高稀疏性处理相关的控制流挑战。通过多项评估验证了方法的有效性,结果表明随着稀疏度和阶数的提升,加速效果显著增强。