The rapid deployment of machine learning across platforms from milliwatt-class TinyML devices to large language models has made energy efficiency a primary constraint for sustainable AI. Across these scales, performance and energy are increasingly limited by data movement and memory-system behavior rather than by arithmetic throughput alone. This work reviews energy efficient software hardware codesign methods spanning edge inference and training to datacenter-scale LLM serving, covering accelerator architectures (e.g., ASIC/FPGA dataflows, processing-/compute-in-memory designs) and system-level techniques (e.g., partitioning, quantization, scheduling, and runtime adaptation). We distill common design levers and trade-offs, and highlight recurring gaps including limited cross-platform generalization, large and costly co-design search spaces, and inconsistent benchmarking across workloads and deployment settings. Finally, we outline a hierarchical decomposition perspective that maps optimization strategies to computational roles and supports incremental adaptation, offering practical guidance for building energy and carbon aware ML systems.
翻译:机器学习在从毫瓦级微型机器学习设备到大语言模型的各平台上的快速部署,使得能效成为可持续人工智能的首要约束。在这些规模上,性能和能耗越来越受到数据移动和内存系统行为的限制,而不仅仅是算术吞吐量。本文回顾了涵盖从边缘推理和训练到数据中心规模大语言模型服务的能效软硬件协同设计方法,涵盖加速器架构(例如,ASIC/FPGA数据流、处理/计算内存设计)和系统级技术(例如,分区、量化、调度和运行时自适应)。我们提炼了常见的设计杠杆和权衡,并强调了反复出现的差距,包括跨平台泛化能力有限、庞大且昂贵的协同设计搜索空间,以及跨工作负载和部署设置的不一致基准测试。最后,我们概述了一种分层分解视角,将优化策略映射到计算角色并支持增量自适应,为构建具备能耗和碳感知能力的机器学习系统提供了实用指导。