A common approach to make machine learning inference more efficient is to use example-specific adaptive schemes, which route or select models for each example at inference time. In this work we study a simple scheme for adaptive inference. We build a cascade of ensembles (CoE), beginning with resource-efficient models and growing to larger, more expressive models, where ensemble agreement serves as a data-dependent routing criterion. This scheme is easy to incorporate into existing inference pipelines, requires no additional training, and can be used to place models across multiple resource tiers--for instance, serving efficient models at the edge and invoking larger models in the cloud only when necessary. In cases where parallel inference is feasible, we show that CoE can improve accuracy relative to the single best model while reducing the average cost of inference by up to 7x, and provides Pareto-dominate solutions in accuracy and efficiency relative to existing adaptive inference baselines. These savings translate to an over 3x-reduction in total monetary cost when performing inference using a heterogeneous cluster of GPUs. Finally, for edge inference scenarios where portions of the cascade reside at the edge vs. in the cloud, CoE can provide a 14x reduction in communication cost and inference latency without sacrificing accuracy.
翻译:提升机器学习推理效率的一种常见方法是采用示例特定的自适应方案,即在推理时根据每个示例动态路由或选择模型。本研究探讨了一种简单的自适应推理方案。我们构建了一个级联集成模型(CoE),从资源高效的模型开始,逐步扩展到更大、更具表达力的模型,其中集成一致性作为数据依赖的路由准则。该方案易于集成到现有推理流程中,无需额外训练,并可将模型部署于多个资源层级——例如,在边缘设备上运行高效模型,仅在必要时调用云端的大型模型。在并行推理可行的情况下,我们证明CoE能够相对于单一最佳模型提升准确率,同时将平均推理成本降低高达7倍,并在准确率与效率方面相对于现有自适应推理基线提供帕累托占优解。当使用异构GPU集群进行推理时,这些节省可转化为总货币成本降低超过3倍。最后,对于级联部分部署于边缘设备、部分部署于云端的边缘推理场景,CoE可在不牺牲准确率的前提下,将通信成本和推理延迟降低14倍。