Data is found everywhere, from health and human infrastructure to the surge of sensors and the proliferation of internet-connected devices. To meet this challenge, the data engineering field has expanded significantly in recent years in both research and industry. Traditionally, data engineering, Machine Learning, and AI workloads have been run on large clusters within data center environments, requiring substantial investment in hardware and maintenance. With the rise of the public cloud, it is now possible to run large applications across nodes without owning or maintaining hardware. Serverless functions such as AWS Lambda provide horizontal scaling and precise billing without the hassle of managing traditional cloud infrastructure. However, when processing large datasets, users often rely on external storage options that are significantly slower than direct communication typical of HPC clusters. We introduce Cylon, a high-performance distributed data frame solution that has shown promising results for data processing using Python. We describe how we took inspiration from the FMI library and designed a serverless communicator to tackle communication and performance issues associated with serverless functions. With our design, we demonstrate that the scaling efficiency of AWS Lambda achieves within 6.5% of serverful AWS (EC2) at 64 nodes, based on implementing direct communication via NAT Traversal TCP Hole Punching.
翻译:数据无处不在,从医疗健康与人类基础设施到传感器的激增以及互联网连接设备的普及。为应对这一挑战,数据工程领域近年来在学术界与工业界均取得了显著发展。传统上,数据工程、机器学习与人工智能工作负载通常在数据中心环境的大型集群上运行,这需要对硬件和维护进行大量投资。随着公有云的兴起,如今无需拥有或维护硬件即可跨节点运行大型应用。诸如AWS Lambda之类的无服务器函数提供了水平扩展和精确计费功能,同时避免了管理传统云基础设施的繁琐流程。然而,在处理大规模数据集时,用户往往依赖外部存储方案,其速度远低于HPC集群典型的直接通信方式。本文介绍Cylon——一种基于Python的高性能分布式数据框解决方案,其在数据处理方面已展现出优异性能。我们阐述了如何借鉴FMI库的设计思想,构建了一种无服务器通信器以解决无服务器函数相关的通信与性能问题。通过我们的设计,基于NAT穿透TCP打孔技术实现直接通信,我们证明了AWS Lambda在64节点规模下的扩展效率可达传统服务器模式AWS(EC2)的93.5%以上。