Combining Serverless and High-Performance Computing Paradigms to support ML Data-Intensive Applications

Data is found everywhere, from health and human infrastructure to the surge of sensors and the proliferation of internet-connected devices. To meet this challenge, the data engineering field has expanded significantly in recent years in both research and industry. Traditionally, data engineering, Machine Learning, and AI workloads have been run on large clusters within data center environments, requiring substantial investment in hardware and maintenance. With the rise of the public cloud, it is now possible to run large applications across nodes without owning or maintaining hardware. Serverless functions such as AWS Lambda provide horizontal scaling and precise billing without the hassle of managing traditional cloud infrastructure. However, when processing large datasets, users often rely on external storage options that are significantly slower than direct communication typical of HPC clusters. We introduce Cylon, a high-performance distributed data frame solution that has shown promising results for data processing using Python. We describe how we took inspiration from the FMI library and designed a serverless communicator to tackle communication and performance issues associated with serverless functions. With our design, we demonstrate that the scaling efficiency of AWS Lambda achieves within 6.5% of serverful AWS (EC2) at 64 nodes, based on implementing direct communication via NAT Traversal TCP Hole Punching.

翻译：数据无处不在，从医疗健康与人类基础设施到传感器的激增以及互联网连接设备的普及。为应对这一挑战，数据工程领域近年来在学术界与工业界均取得了显著发展。传统上，数据工程、机器学习与人工智能工作负载通常在数据中心环境的大型集群上运行，这需要对硬件和维护进行大量投资。随着公有云的兴起，如今无需拥有或维护硬件即可跨节点运行大型应用。诸如AWS Lambda之类的无服务器函数提供了水平扩展和精确计费功能，同时避免了管理传统云基础设施的繁琐流程。然而，在处理大规模数据集时，用户往往依赖外部存储方案，其速度远低于HPC集群典型的直接通信方式。本文介绍Cylon——一种基于Python的高性能分布式数据框解决方案，其在数据处理方面已展现出优异性能。我们阐述了如何借鉴FMI库的设计思想，构建了一种无服务器通信器以解决无服务器函数相关的通信与性能问题。通过我们的设计，基于NAT穿透TCP打孔技术实现直接通信，我们证明了AWS Lambda在64节点规模下的扩展效率可达传统服务器模式AWS（EC2）的93.5%以上。

相关内容

服务器

关注 14

服务器，也称伺服器，是提供计算服务的设备。由于服务器需要响应服务请求，并进行处理，因此一般来说服务器应具备承担服务并且保障服务的能力。
服务器的构成包括处理器、硬盘、内存、系统总线等，和通用的计算机架构类似，但是由于需要提供高可靠的服务，因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。

【牛津博士论文】面向视觉、物理与语言应用的可信机器学习模型

专知会员服务

19+阅读 · 2025年10月5日

《数据基础设施和研发基础设施项目之间互联互通框架》美国国家科学技术委员会最新报告

专知会员服务

29+阅读 · 2025年1月4日

《数据要素与先进存储融合发展研究报告》||（附PDF下载方式）

专知会员服务

21+阅读 · 2024年10月4日

什么是Data-Centric AI？Rice大学最新《以数据为中心的人工智能》研究综述，38页pdf全面阐述DCAI技术体系

专知会员服务

76+阅读 · 2023年3月21日