DySkew: Dynamic Data Redistribution for Skew-Resilient Snowpark UDF Execution

Snowflake revolutionized data warehousing with an elastic architecture that decouples compute and storage, enabling scalable solutions for diverse data analytics needs. Building on this foundation, Snowflake has advanced its AI Data Cloud vision by introducing Snowpark, a managed turnkey solution that supports data engineering and AI/ML workloads using Python and other programming languages. While Snowpark's User-Defined Function (UDF) execution model offers high throughput, it is highly vulnerable to performance degradation from data skew, where uneven data partitioning causes straggler tasks and unpredictable latency. The non-uniform computational cost of arbitrary user code further exacerbates this classic challenge. This paper presents DySkew, a novel, data-skew-aware execution strategy for Snowpark UDFs. Built upon Snowflake's new generalized skew handling solution, an adaptive data distribution mechanism utilizing per-link state machines. DySkew addresses the unique challenges of user-defined logic with goals of fine-grained per-row mitigation, dynamic runtime adaptation, and low-overhead, cost-aware redistribution. Specifically, for Snowpark, we introduce crucial optimizations, including an eager redistribution strategy and a Row Size Model to dynamically manage overhead for extremely large rows. This dynamic approach replaces the limitations of the previous static round-robin method. We detail the architecture of this framework and showcase its effectiveness through performance evaluations and real-world case studies, demonstrating significant improvements in the execution time and resource utilization for large-scale Snowpark UDF workloads.

翻译：Snowflake通过解耦计算与存储的弹性架构革新了数据仓库领域，为多样化数据分析需求提供了可扩展的解决方案。在此基础上，Snowflake通过引入Snowpark进一步推进其AI数据云愿景——这是一个托管式交钥匙解决方案，支持使用Python及其他编程语言处理数据工程与AI/ML负载。尽管Snowpark的用户自定义函数执行模型具有高吞吐特性，但其极易因数据偏斜导致性能退化：不均匀的数据分区引发拖后腿任务及不可预测的延迟。任意用户代码的非均匀计算成本进一步加剧了这一经典挑战。本文提出DySkew，一种面向Snowpark UDF的新型数据偏斜感知执行策略。该策略基于Snowflake全新的通用偏斜处理解决方案，即利用基于链路的状态机实现自适应数据分布机制。DySkew针对用户自定义逻辑的特殊挑战，以实现细粒度逐行缓解、动态运行时自适应以及低开销成本感知重分布为目标。具体而言，针对Snowpark我们引入了关键优化，包括主动重分布策略与行大小模型，以动态管理超大行的开销。这种动态方法取代了先前静态循环轮询方式的局限性。我们详细阐述了该框架的架构，并通过性能评估与真实案例研究展示了其有效性，证明了其在大规模Snowpark UDF工作负载执行时间与资源利用率方面的显著提升。