Data science pipelines commonly utilize dataframe and array operations for tasks such as data preprocessing, analysis, and machine learning. The most popular tools for these tasks are pandas and NumPy. However, these tools are limited to executing on a single node, making them unsuitable for processing large-scale data. Several systems have attempted to distribute data science applications to clusters while maintaining interfaces similar to single-node libraries, enabling data scientists to scale their workloads without significant effort. However, existing systems often struggle with processing large datasets due to Out-of-Memory (OOM) problems caused by poor data partitioning. To overcome these challenges, we develop Xorbits, a high-performance, scalable data science framework specifically designed to distribute data science workloads across clusters while retaining familiar APIs. The key differentiator of Xorbits is its ability to dynamically switch between graph construction and graph execution. Xorbits has been successfully deployed in production environments with up to 5k CPU cores. Its applications span various domains, including user behavior analysis and recommendation systems in the e-commerce sector, as well as credit assessment and risk management in the finance industry. Users can easily scale their data science workloads by simply changing the import line of their pandas and NumPy code. Our experiments demonstrate that Xorbits can effectively process very large datasets without encountering OOM or data-skewing problems. Over the fastest state-of-the-art solutions, Xorbits achieves an impressive 2.66* speedup on average. In terms of API coverage, Xorbits attains a compatibility rate of 96.7%, surpassing the fastest framework by an impressive margin of 60 percentage points. Xorbits is available at https://github.com/xorbitsai/xorbits.
翻译:数据科学流水线普遍利用数据框和数组操作来完成数据预处理、分析及机器学习等任务。最常用的工具有pandas和NumPy,但这些工具仅限于在单节点上执行,难以处理大规模数据。部分系统尝试将数据科学应用分布到集群,同时保持与单节点库相似的接口,使数据科学家无需大量精力即可扩展其工作负载。然而,现有系统常因数据分区不佳导致的"内存溢出"(OOM)问题在处理大型数据集时陷入困境。为应对这些挑战,我们开发了Xorbits——一个高性能、可扩展的数据科学框架,专为在集群上分布式运行数据科学工作负载设计,同时保留用户熟悉的API。Xorbits的核心差异在于其能在图构建与图执行之间动态切换。该框架已在生产环境中成功部署,支持高达5000个CPU核心。其应用涵盖电子商务领域的用户行为分析和推荐系统,以及金融行业的信用评估与风险管理等多个领域。用户仅需修改pandas和NumPy代码中的导入行,即可轻松扩展其数据科学工作负载。实验表明,Xorbits能有效处理超大规模数据集而不出现OOM或数据倾斜问题。相较于最先进的解决方案,Xorbits平均实现了2.66倍的加速比。在API覆盖率方面,Xorbits达到96.7%的兼容性,比最快的框架高出60个百分点。Xorbits代码已开源,访问地址为https://github.com/xorbitsai/xorbits。