Serverless computing relieves developers from the burden of resource management, thus providing ease-of-use to the users and the opportunity to optimize resource utilization for the providers. However, today's serverless systems lack performance guarantees for function invocations, thus limiting support for performance-critical applications: we observed severe performance variability (up to 6x). Providers lack visibility into user functions and hence find it challenging to right-size them: we observed heavy resource underutilization (up to 80%). To understand the causes behind the performance variability and underutilization, we conducted a measurement study of commonly deployed serverless functions and learned that the function performance and resource utilization depend crucially on function semantics and inputs. Our key insight is to delay making resource allocation decisions until after the function inputs are available. We introduce Shabari, a resource management framework for serverless systems that makes decisions as late as possible to right-size each invocation to meet functions' performance objectives (SLOs) and improve resource utilization. Shabari uses an online learning agent to right-size each function invocation based on the features of the function input and makes cold-start-aware scheduling decisions. For a range of serverless functions and inputs, Shabari reduces SLO violations by 11-73% while not wasting any vCPUs and reducing wasted memory by 64-94% in the median case, compared to state-of-the-art systems, including Aquatope, Parrotfish, and Cypress.
翻译:无服务器计算将开发者从资源管理的负担中解放出来,从而为用户提供易用性,并为提供商优化资源利用率创造了机会。然而,当前的无服务器系统缺乏对函数调用的性能保证,因此限制了对性能关键型应用的支持:我们观察到严重的性能波动(高达6倍)。提供商无法洞察用户函数,因此难以对其进行合理配置:我们观察到严重的资源利用不足(高达80%)。为了理解性能波动和资源利用不足背后的原因,我们对常见部署的无服务器函数进行了测量研究,发现函数性能和资源利用率关键取决于函数的语义和输入。我们的核心见解是:在获取函数输入后再决定资源分配。我们提出了Shabari——一种无服务器系统的资源管理框架,它尽可能延迟决策,以合理配置每次调用,从而满足函数的性能目标(SLO)并提高资源利用率。Shabari使用在线学习代理,基于函数输入的特征对每次函数调用进行合理配置,并做出感知冷启动的调度决策。针对一系列无服务器函数和输入,相比包括Aquatope、Parrotfish和Cypress在内的最新系统,Shabari在中等情况下将SLO违规减少了11-73%,同时不浪费任何vCPU,并将浪费的内存减少了64-94%。