Cloud computing has become a major approach to help reproduce computational experiments. Yet there are still two main difficulties in reproducing batch based big data analytics (including descriptive and predictive analytics) in the cloud. The first is how to automate end-to-end scalable execution of analytics including distributed environment provisioning, analytics pipeline description, parallel execution, and resource termination. The second is that an application developed for one cloud is difficult to be reproduced in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automated scalable execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. We propose and develop an open-source toolkit that supports 1) fully automated end-to-end execution and reproduction via a single command, 2) automated data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproduction of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using four big data analytics applications that run on virtual CPU/GPU clusters. The experiments show our toolkit can achieve good execution performance, scalability, and efficient reproducibility for cloud-based big data analytics.
翻译:云计算已成为辅助再现计算实验的主要途径,但在云上再现基于批处理的大数据分析(包括描述性分析和预测性分析)时仍面临两大困难。其一是如何实现分析过程的端到端可扩展自动化执行,涵盖分布式环境配置、分析流程描述、并行执行以及资源终止。其二是为某一云平台开发的应用程序难以在另一云平台上再现,即所谓的供应商锁定问题。为解决这些问题,我们利用无服务器计算与容器化技术实现自动化可扩展执行与可再现性,并引入适配器设计模式实现跨云平台的应用程序可移植性与可再现性。我们提出并开发了一套开源工具包,支持:1)通过单一命令实现全自动端到端执行与再现;2)为每次执行自动存储数据与配置;3)基于用户偏好的灵活客户端模式;4)执行历史查询;5)在相同或不同环境下对现有执行进行简单再现。我们在AWS和Azure平台上使用四个运行于虚拟CPU/GPU集群的大数据分析应用进行了广泛实验。实验表明,该工具包在基于云的大数据分析中能够实现良好的执行性能、可扩展性及高效的可再现性。