The Cambrian explosion of new accelerators, driven by the slowdown of Moore's Law, has created significant resource management challenges for modern IaaS clouds. Unlike the homogeneous datacenters backing legacy clouds, emerging neoclouds amass a diverse portfolio of heterogeneous hardware -- NVIDIA GPUs, TPUs, Trainium chips, and FPGAs. Neocloud operators and tenants must transition from managing a single large pool of computational resources to navigating a set of highly fragmented and constrained pools. We argue that cloud resource management mechanisms and interfaces require a fundamental rethink to enable efficient and economical neoclouds. Specifically we propose shifting from long-term static resource allocation with fixed-pricing to dynamic allocation with continuous, multilateral cost re-negotatiaton. We demonstrate this approach is not only feasible for modern applications but also significantly improves resource efficiency and reduces costs. Finally, we propose a new architecture for the interaction between operators, tenants, and applications in neoclouds.
翻译:摩尔定律的放缓催生了新型加速器的"寒武纪大爆发",这给现代基础设施即服务(IaaS)云带来了显著的资源管理挑战。与支撑传统云的同构数据中心不同,新兴的"新云"汇集了多样化的异构硬件组合——包括NVIDIA GPU、TPU、Trainium芯片和FPGA。新云运营商和租户必须从管理单一大型计算资源池,转向应对一系列高度碎片化且受限的资源池。我们认为,云资源管理机制与接口需要根本性的重新设计,以实现高效且经济的新云架构。具体而言,我们建议从长期静态资源分配与固定定价模式,转向支持持续多边成本重协商的动态分配机制。我们证明这种方法不仅适用于现代应用场景,还能显著提升资源效率并降低成本。最后,我们提出了一种适用于新云环境中运营商、租户与应用间交互的新型架构。