Vision Transformers (ViTs) have established new performance benchmarks in vision tasks such as image recognition and object detection. However, these advancements come with significant demands for memory and computational resources, presenting challenges for hardware deployment. Heterogeneous compute-in-memory (CIM) accelerators have emerged as a promising solution for enabling energy-efficient deployment of ViTs. Despite this potential, monolithic CIM-based designs face scalability issues due to the size limitations of a single chip. To address this challenge, emerging chiplet-based techniques offer a more scalable alternative. However, chiplet designs come with their own costs, as they introduce expensive communication, which can hinder improvements in throughput. This work introduces Hemlet, a heterogeneous CIM chiplet system designed to accelerate ViT workloads. Hemlet enables flexible resource scaling through the integration of heterogeneous analog CIM (ACIM), digital CIM (DCIM), and Intermediate Data Process (IDP) chiplets. To improve throughput while reducing communication overhead, it employs a group-level parallelism (GLP) mapping strategy and system-level dataflow optimization, achieving speedups ranging from 2.41x to 5.74x across various hardware configurations within the chiplet system. Our evaluation results show that Hemlet can reach a throughput of 9.56 TOPS with an energy efficiency of 4.98 TOPS/W.
翻译:视觉Transformer(ViT)已在图像识别和目标检测等视觉任务中确立了新的性能基准。然而,这些进步伴随着对内存和计算资源的巨大需求,给硬件部署带来了挑战。异构存内计算(CIM)加速器已成为实现ViT高效能部署的一种有前景的解决方案。尽管具有潜力,基于单片CIM的设计由于单芯片的尺寸限制而面临可扩展性问题。为应对这一挑战,新兴的小芯片技术提供了一种更具可扩展性的替代方案。然而,小芯片设计也带来了自身的成本,因为它们引入了昂贵的通信开销,这可能会阻碍吞吐量的提升。本文介绍了Hemlet,一种专为加速ViT工作负载而设计的异构CIM小芯片系统。Hemlet通过集成异构模拟存内计算(ACIM)、数字存内计算(DCIM)和中间数据处理(IDP)小芯片,实现了灵活的资源扩展。为了提高吞吐量并减少通信开销,该系统采用了分组级并行性(GLP)映射策略和系统级数据流优化,在小芯片系统内的各种硬件配置下实现了2.41倍至5.74倍的加速。我们的评估结果表明,Hemlet的吞吐量可达9.56 TOPS,能效为4.98 TOPS/W。