Data science pipelines commonly utilize dataframe and array operations for tasks such as data preprocessing, analysis, and machine learning. The most popular tools for these tasks are pandas and NumPy. However, these tools are limited to executing on a single node, making them unsuitable for processing large-scale data. Several systems have attempted to distribute data science applications to clusters while maintaining interfaces similar to single-node libraries, enabling data scientists to scale their workloads without significant effort. However, existing systems often struggle with processing large datasets due to Out-of-Memory (OOM) problems caused by poor data partitioning. To overcome these challenges, we develop Xorbits, a high-performance, scalable data science framework specifically designed to distribute data science workloads across clusters while retaining familiar APIs. The key differentiator of Xorbits is its ability to dynamically switch between graph construction and graph execution. Xorbits has been successfully deployed in production environments with up to 5k CPU cores. Its applications span various domains, including user behavior analysis and recommendation systems in the e-commerce sector, as well as credit assessment and risk management in the finance industry. Users can easily scale their data science workloads by simply changing the import line of their pandas and NumPy code. Our experiments demonstrate that Xorbits can effectively process very large datasets without encountering OOM or data-skewing problems. Over the fastest state-of-the-art solutions, Xorbits achieves an impressive 2.66* speedup on average. In terms of API coverage, Xorbits attains a compatibility rate of 96.7%, surpassing the fastest framework by an impressive margin of 60 percentage points. Xorbits is available at https: //github.com/xorbitsai/xorbits.
翻译:数据科学管线通常采用数据框和数组操作来完成数据预处理、分析与机器学习等任务,其中pandas和NumPy是最常用的工具。然而,这些工具仅支持单节点执行,难以处理大规模数据。已有多个系统尝试将数据科学应用分布式部署到集群中,同时保留与单节点库相似的接口,使数据科学家无需大量额外工作即可扩展其工作负载。然而,现有系统在处理大规模数据集时,常因数据分区不当引发内存溢出(OOM)问题。为解决这些挑战,我们开发了Xorbits——一个高性能、可扩展的数据科学框架,专为在集群间分布式部署数据科学工作负载而设计,同时保留用户熟悉的API。Xorbits的核心差异化特性在于其能够动态切换图构建与图执行过程。目前Xorbits已在包含多达5000个CPU核心的生产环境中成功部署,其应用涵盖电商领域的用户行为分析与推荐系统,以及金融行业的信用评估与风险管理。用户仅需修改pandas和NumPy代码中的导入行,即可轻松扩展数据科学工作负载。实验表明,Xorbits能有效处理超大规模数据集,无内存溢出或数据倾斜问题。相较于最先进的解决方案,Xorbits平均实现2.66倍的加速比。在API覆盖率方面,Xorbits达到96.7%的兼容率,比最快框架高出60个百分点。Xorbits的开源代码已发布在https://github.com/xorbitsai/xorbits。