Serverless Cold Starts and Where to Find Them

This paper releases and analyzes a month-long trace of 85 billion user requests and 11.9 million cold starts from Huawei's serverless cloud platform. Our analysis spans workloads from five data centers. We focus on cold starts and provide a comprehensive examination of the underlying factors influencing the number and duration of cold starts. These factors include trigger types, request synchronicity, runtime languages, and function resource allocations. We investigate components of cold starts, including pod allocation time, code and dependency deployment time, and scheduling delays, and examine their relationships with runtime languages, trigger types, and resource allocation. We introduce pod utility ratio to measure the pod's useful lifetime relative to its cold start time, giving a more complete picture of cold starts, and see that some pods with long cold start times have longer useful lifetimes. Our findings reveal the complexity and multifaceted origins of the number, duration, and characteristics of cold starts, driven by differences in trigger types, runtime languages, and function resource allocations. For example, cold starts in Region 1 take up to 7 seconds, dominated by dependency deployment time and scheduling. In Region 2, cold starts take up to 3 seconds and are dominated by pod allocation time. Based on this, we identify opportunities to reduce the number and duration of cold starts using strategies for multi-region scheduling. Finally, we suggest directions for future research to address these challenges and enhance the performance of serverless cloud platforms. Our datasets and code are available here https://github.com/sir-lab/data-release

翻译：本文发布并分析了华为无服务器云平台为期一个月、涵盖850亿用户请求和1190万次冷启动的追踪数据。我们的分析覆盖了五个数据中心的工作负载，重点关注冷启动现象，并对影响冷启动次数与持续时间的根本因素进行了全面考察。这些因素包括触发器类型、请求同步性、运行时语言及函数资源分配。我们研究了冷启动的各个组成部分，包括容器组分配时间、代码与依赖部署时间以及调度延迟，并考察了它们与运行时语言、触发器类型和资源分配之间的关系。我们引入容器组效用比这一指标，用于衡量容器组有效生命周期相对于其冷启动时间的比例，从而更完整地刻画冷启动特征，并发现某些具有较长冷启动时间的容器组反而拥有更长的有效生命周期。研究结果表明，冷启动的次数、持续时间及特征具有复杂且多方面的成因，主要受触发器类型差异、运行时语言差异及函数资源分配差异的影响。例如，区域1的冷启动耗时可达7秒，主要受依赖部署时间和调度延迟主导；而区域2的冷启动耗时可达3秒，主要受容器组分配时间主导。基于此，我们提出了通过多区域调度策略来减少冷启动次数与持续时间的优化机会。最后，我们针对这些挑战提出了未来研究方向，以提升无服务器云平台的性能。本研究的完整数据集与代码已公开：https://github.com/sir-lab/data-release