Public clouds increasingly expose heterogeneous hardware, but their allocation interface remains built around rigid on-demand and spot service classes. This makes it hard to satisfy time-varying tenant objectives and operator constraints in oversubscribed, heterogeneous clusters without exposing internal application or infrastructure state. We present LaissezCloud, a cloud resource management platform for continuous re-negotiation of running allocations. Unlike spot instances, which use launch-time bids and unilateral preemption, LaissezCloud keeps allocations continuously contestable during execution: tenants and operators update bids online, and a running tenant keeps a resource only as long as its bid exceeds competing demand. Pricing serves both as a narrow waist and as an incentive-alignment mechanism between mutually untrusted participants: tenants express utility through bids, while operators price in power, cooling, or carbon constraints without exposing internal telemetry. Across a diverse set of accelerator workloads, LaissezCloud reduces performance degradation under contention by 8-23% versus on-demand and spot baselines, and scales to clusters of at least 10,000 nodes.
翻译:公有云日益暴露异构硬件的特征,但其资源分配接口仍固守于按需实例和竞价实例等僵化的服务类型。这使得在超售异构集群中,既难以满足租户随时间动态变化的目标,又难以满足运维方的约束条件,且同时避免暴露内部应用或基础设施状态。本文提出LaissezCloud——一种面向运行中资源持续再协商的云资源管理平台。与采用启动时竞价和单方面抢占机制的竞价实例不同,LaissezCloud在任务执行期间保持资源分配的持续可竞争性:租户与运维方在线更新报价,运行中的租户仅在其报价高于竞争性需求时才能持续持有资源。定价机制既作为资源接口的窄腰(narrow waist),又作为互不信任参与方之间的激励对齐机制:租户通过报价表达效用,而运维方在不暴露内部遥测数据的前提下,将电力、冷却或碳排放约束转化为价格信号。在多样化的加速器工作负载下,相较于按需实例和竞价实例基线,LaissezCloud可将竞争场景下的性能退化降低8-23%,并支持至少10,000节点的集群规模扩展。