Xorbits: Automating Operator Tiling for Distributed Data Science

Data science pipelines commonly utilize dataframe and array operations for tasks such as data preprocessing, analysis, and machine learning. The most popular tools for these tasks are pandas and NumPy. However, these tools are limited to executing on a single node, making them unsuitable for processing large-scale data. Several systems have attempted to distribute data science applications to clusters while maintaining interfaces similar to single-node libraries, enabling data scientists to scale their workloads without significant effort. However, existing systems often struggle with processing large datasets due to Out-of-Memory (OOM) problems caused by poor data partitioning. To overcome these challenges, we develop Xorbits, a high-performance, scalable data science framework specifically designed to distribute data science workloads across clusters while retaining familiar APIs. The key differentiator of Xorbits is its ability to dynamically switch between graph construction and graph execution. Xorbits has been successfully deployed in production environments with up to 5k CPU cores. Its applications span various domains, including user behavior analysis and recommendation systems in the e-commerce sector, as well as credit assessment and risk management in the finance industry. Users can easily scale their data science workloads by simply changing the import line of their pandas and NumPy code. Our experiments demonstrate that Xorbits can effectively process very large datasets without encountering OOM or data-skewing problems. Over the fastest state-of-the-art solutions, Xorbits achieves an impressive 2.66* speedup on average. In terms of API coverage, Xorbits attains a compatibility rate of 96.7%, surpassing the fastest framework by an impressive margin of 60 percentage points. Xorbits is available at https: //github.com/xorbitsai/xorbits.

翻译：数据科学管线通常采用数据框和数组操作来完成数据预处理、分析与机器学习等任务，其中pandas和NumPy是最常用的工具。然而，这些工具仅支持单节点执行，难以处理大规模数据。已有多个系统尝试将数据科学应用分布式部署到集群中，同时保留与单节点库相似的接口，使数据科学家无需大量额外工作即可扩展其工作负载。然而，现有系统在处理大规模数据集时，常因数据分区不当引发内存溢出（OOM）问题。为解决这些挑战，我们开发了Xorbits——一个高性能、可扩展的数据科学框架，专为在集群间分布式部署数据科学工作负载而设计，同时保留用户熟悉的API。Xorbits的核心差异化特性在于其能够动态切换图构建与图执行过程。目前Xorbits已在包含多达5000个CPU核心的生产环境中成功部署，其应用涵盖电商领域的用户行为分析与推荐系统，以及金融行业的信用评估与风险管理。用户仅需修改pandas和NumPy代码中的导入行，即可轻松扩展数据科学工作负载。实验表明，Xorbits能有效处理超大规模数据集，无内存溢出或数据倾斜问题。相较于最先进的解决方案，Xorbits平均实现2.66倍的加速比。在API覆盖率方面，Xorbits达到96.7%的兼容率，比最快框架高出60个百分点。Xorbits的开源代码已发布在https://github.com/xorbitsai/xorbits。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日