Reproducible and Portable Big Data Analytics in the Cloud

Cloud computing has become a major approach to help reproduce computational experiments. Yet there are still two main difficulties in reproducing batch based big data analytics (including descriptive and predictive analytics) in the cloud. The first is how to automate end-to-end scalable execution of analytics including distributed environment provisioning, analytics pipeline description, parallel execution, and resource termination. The second is that an application developed for one cloud is difficult to be reproduced in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automated scalable execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. We propose and develop an open-source toolkit that supports 1) fully automated end-to-end execution and reproduction via a single command, 2) automated data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproduction of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using four big data analytics applications that run on virtual CPU/GPU clusters. The experiments show our toolkit can achieve good execution performance, scalability, and efficient reproducibility for cloud-based big data analytics.

翻译：云计算已成为帮助重现计算实验的主要方法。然而，在云中重现基于批量的大数据分析（包括描述性和预测性分析）仍存在两大难点。第一是如何自动化端到端的可扩展分析执行，包括分布式环境配置、分析流程描述、并行执行和资源终止。第二是为一个云开发的应用难以在另一个云中重现，即所谓的供应商锁定问题。为解决这些问题，我们利用无服务器计算和容器化技术实现自动化的可扩展执行与可重现性，并运用适配器设计模式实现跨不同云的应用可移植性与可重现性。我们提出并开发了一个开源工具包，支持：1）通过单条命令实现完全自动化的端到端执行与重现；2）每次执行的自动化数据与配置存储；3）基于用户偏好的灵活客户端模式；4）执行历史查询；5）在同一或不同环境中对已有执行的简单重现。我们在AWS和Azure上针对运行于虚拟CPU/GPU集群的四个大数据分析应用进行了广泛实验。实验表明，我们的工具包能够为基于云的大数据分析实现良好的执行性能、可扩展性和高效的可重现性。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。