With the increasing adoption of AI applications such as large language models and computer vision AI, the computational demands on AI inference systems are continuously rising, making the enhancement of task processing capacity using existing hardware a primary objective in edge clouds. We propose EPARA, an end-to-end AI parallel inference framework in edge, aimed at enhancing the edge AI serving capability. Our key idea is to categorize tasks based on their sensitivity to latency/frequency and requirement for GPU resources, thereby achieving both request-level and service-level task-resource allocation. EPARA consists of three core components: 1) a task-categorized parallelism allocator that decides the parallel mode of each task, 2) a distributed request handler that performs the calculation for the specific request, and 3) a state-aware scheduler that periodically updates service placement in edge clouds. We implement a EPARA prototype and conduct a case study on the EPARA operation for LLMs and segmentation tasks. Evaluation through testbed experiments involving edge servers, embedded devices, and microcomputers shows that EPARA achieves up to 2.1$\times$ higher goodput in production workloads compared to prior frameworks, while adapting to various edge AI inference tasks.
翻译:随着大型语言模型和计算机视觉AI等人工智能应用的日益普及,AI推理系统的计算需求持续攀升,利用现有硬件提升任务处理能力成为边缘云的主要目标。我们提出了EPARA,一种端到端的边缘AI并行推理框架,旨在增强边缘AI服务能力。其核心思想是根据任务对延迟/频率的敏感度以及对GPU资源的需求进行分类,从而实现请求级和服务级的任务-资源分配。EPARA包含三个核心组件:1)任务分类并行分配器,用于决定每个任务的并行模式;2)分布式请求处理器,负责执行具体请求的计算;3)状态感知调度器,定期更新边缘云中的服务部署。我们实现了EPARA原型,并针对LLM和分割任务进行了EPARA运行的案例研究。通过在边缘服务器、嵌入式设备和微型计算机组成的测试平台上进行实验评估,结果表明,与现有框架相比,EPARA在生产负载中实现了高达2.1倍的有效吞吐量提升,同时能够适应多样化的边缘AI推理任务。