As deep learning models scale, sparse computation and specialized dataflow hardware have emerged as powerful solutions to address efficiency. We propose FuseFlow, a compiler that converts sparse machine learning models written in PyTorch to fused sparse dataflow graphs for reconfigurable dataflow architectures (RDAs). FuseFlow is the first compiler to support general cross-expression fusion of sparse operations. In addition to fusion across kernels (expressions), FuseFlow also supports optimizations like parallelization, dataflow ordering, and sparsity blocking. It targets a cycle-accurate dataflow simulator for microarchitectural analysis of fusion strategies. We use FuseFlow for design-space exploration across four real-world machine learning applications with sparsity, showing that full fusion (entire cross-expression fusion across all computation in an end-to-end model) is not always optimal for sparse models-fusion granularity depends on the model itself. FuseFlow also provides a heuristic to identify and prune suboptimal configurations. Using Fuseflow, we achieve performance improvements, including a ~2.7x speedup over an unfused baseline for GPT-3 with BigBird block-sparse attention.
翻译:随着深度学习模型规模的扩大,稀疏计算与专用数据流硬件已成为提升效率的重要解决方案。本文提出FuseFlow编译器,可将PyTorch编写的稀疏机器学习模型转换为适用于可重构数据流架构的融合稀疏数据流图。FuseFlow是首个支持稀疏操作通用跨表达式融合的编译器。除跨内核(表达式)融合外,FuseFlow还支持并行化、数据流排序及稀疏分块等优化技术。该框架以周期精确数据流模拟器为目标,用于融合策略的微架构分析。我们在四个真实世界稀疏机器学习应用中使用FuseFlow进行设计空间探索,结果表明:完全融合(端到端模型中所有计算的完整跨表达式融合)对稀疏模型并非总是最优方案——融合粒度取决于模型自身特性。FuseFlow还提供启发式方法以识别并剪枝次优配置。通过FuseFlow,我们实现了性能提升,其中GPT-3模型采用BigBird块稀疏注意力机制时,较未融合基准获得约2.7倍的加速比。