While Function as a Service (FaaS) platforms can initialize function sandboxes on worker nodes in 10-100s of milliseconds, the latency to schedule functions in real FaaS clusters can be orders of magnitude higher. We find that the current approach of building FaaS cluster managers on top of legacy orchestration systems like Kubernetes leads to high scheduling delay at high sandbox churn, which is typical in FaaS clusters. While generic cluster managers use hierarchical abstractions and multiple internal components to manage and reconcile state with frequent persistent updates, this becomes a bottleneck for FaaS, where cluster state frequently changes as sandboxes are created on the critical path of requests. Based on our root cause analysis of performance issues in existing FaaS cluster managers, we propose Dirigent, a clean-slate system architecture for FaaS orchestration with three key principles. First, Dirigent optimizes internal cluster manager abstractions to simplify state management. Second, it eliminates persistent state updates on the critical path of function invocations, leveraging the fact that FaaS abstracts sandboxes from users to relax exact state reconstruction guarantees. Finally, Dirigent runs monolithic control and data planes to minimize internal communication overheads and maximize throughput. We compare Dirigent to state-of-the-art FaaS platforms and show that Dirigent reduces 99th percentile per-function scheduling latency for a production workload by 2.79x compared to AWS Lambda and can spin up 2500 sandboxes per second at low latency, which is 1250x more than with Knative.
翻译:尽管函数即服务(FaaS)平台能在10-100毫秒内初始化工作节点上的函数沙箱,但在真实FaaS集群中调度函数的延迟可能高出数个数量级。我们发现,当前基于Kubernetes等传统编排系统构建FaaS集群管理器的方法,在FaaS集群典型的高沙箱更替场景下会导致高调度延迟。通用集群管理器使用分层抽象和多个内部组件来管理并通过频繁的持久化更新来协调状态,这成为FaaS的瓶颈——因为沙箱在请求关键路径上创建时,集群状态频繁变化。基于对现有FaaS集群管理器性能问题的根本原因分析,我们提出Dirigent——一种遵循三项核心原则的全新FaaS编排系统架构。第一,Dirigent优化内部集群管理器抽象以简化状态管理。第二,它利用FaaS从用户层面抽象沙箱的特性来放宽精确状态重建保证,从而消除函数调用关键路径上的持久化状态更新。最后,Dirigent运行单体化控制面与数据面,以最小化内部通信开销并最大化吞吐量。将Dirigent与最先进的FaaS平台对比显示,针对生产工作负载,Dirigent将99分位函数级调度延迟降至AWS Lambda的2.79倍以下,并且能以低延迟每秒创建2500个沙箱,达到Knative的1250倍。