AI has led to significant advancements in computer vision and image processing tasks, enabling a wide range of applications in real-life scenarios, from autonomous vehicles to medical imaging. Many of those applications require efficient object detection algorithms and complementary real-time, low latency hardware to perform inference of these algorithms. The YOLO family of models is considered the most efficient for object detection, having only a single model pass. Despite this, the complexity and size of YOLO models can be too computationally demanding for current edge-based platforms. To address this, we present SATAY: a Streaming Architecture Toolflow for Accelerating YOLO. This work tackles the challenges of deploying stateof-the-art object detection models onto FPGA devices for ultralow latency applications, enabling real-time, edge-based object detection. We employ a streaming architecture design for our YOLO accelerators, implementing the complete model on-chip in a deeply pipelined fashion. These accelerators are generated using an automated toolflow, and can target a range of suitable FPGA devices. We introduce novel hardware components to support the operations of YOLO models in a dataflow manner, and off-chip memory buffering to address the limited on-chip memory resources. Our toolflow is able to generate accelerator designs which demonstrate competitive performance and energy characteristics to GPU devices, and which outperform current state-of-the-art FPGA accelerators.
翻译:人工智能在计算机视觉和图像处理任务中取得了显著进展,推动了从自动驾驶到医学成像等众多现实应用场景的发展。其中许多应用需要高效的物体检测算法以及配套的实时低延迟硬件来执行推理。YOLO系列模型因其仅需单次模型前向传播而被视为最高效的物体检测方案。尽管如此,YOLO模型的复杂度和规模对当前边缘计算平台而言仍可能造成过高的计算负担。针对这一问题,我们提出SATAY:一种用于加速YOLO的流式架构工具流。本研究旨在解决将最先进的物体检测模型部署到FPGA设备上以实现超低延迟应用的关键挑战,从而赋能实时边缘端物体检测。我们为YOLO加速器采用流式架构设计,通过深度流水线方式将完整模型实现在片内。这些加速器通过自动化工具流生成,可适配多种适用的FPGA设备。我们引入新型硬件组件以数据流方式支持YOLO模型运算,并采用片外存储缓冲技术应对有限的片内存储资源。该工具流生成的加速器设计在性能与能效特性方面可与GPU设备相媲美,且优于当前最先进的FPGA加速器。