GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization estimation -- a key proxy for contention -- enables interference-aware scheduling. Existing GPU memory estimators span three paradigms -- analytical models, CPU-side libraries, and ML-based estimators -- each with distinct limitations: dependence on detailed model specifications, intrusive integration, poor generalization, and varying latency overhead. GPU heterogeneity further complicates estimation, as identical tasks can exhibit different memory footprints across hardware generations. GPU utilization remains comparatively understudied, further complicated by non-additive utilization metrics and GPU heterogeneity. We conduct a systematic analysis of representative estimators from each paradigm -- Horus, PyTorch FakeTensor, and our lightweight ML-based estimator -- evaluating accuracy, generalizability, and overhead. We construct a synthetic dataset spanning MLPs, CNNs, and Transformers with controlled architectural variations, and train MLP- and Transformer-based estimators for memory prediction, and experiment with utilization estimation. Our evaluation reveals key tradeoffs and validates estimators against real-world unseen models. Significant challenges remain: analytical models lack generalization and cannot easily be extended to new GPU architectures or accurately reflect memory optimization savings; CPU-side libraries impose intrusive integration overhead; and both analytical and ML-based estimators rely on model specifications or computation graphs, limiting generalization across diverse architectures and hardware variants. We release all datasets, tools, and artifacts to support further research.

翻译：共置深度学习训练任务可提升GPU利用率，但会引发资源争用、严重性能下降及内存溢出（OOM）故障。精确的内存估算对于鲁棒共置至关重要，而GPU利用率估算作为争用的关键代理指标，可实现干扰感知调度。现有GPU内存估算器涵盖三种范式——解析模型、CPU侧库及基于机器学习的估算器——其局限性各异：依赖详细模型规范、侵入式集成、泛化能力差及延迟开销差异。GPU异构性进一步加剧估算难度，相同任务在不同硬件代际间可能呈现不同内存占用。GPU利用率研究相对不足，非加和性利用率指标与GPU异构性使其更为复杂。我们对各范式的代表性估算器——Horus、PyTorch FakeTensor及轻量级机器学习估算器——进行系统分析，评估其准确性、泛化能力及开销。构建包含MLP、CNN及Transformer的人工数据集，通过控制架构变量训练基于MLP和Transformer的内存预测模型，并实验探索利用率估算。评估揭示了关键权衡，并在真实未见模型上验证了估算器有效性。当前仍存重大挑战：解析模型缺乏泛化能力，难以扩展至新型GPU架构或准确反映内存优化收益；CPU侧库引入侵入式集成开销；解析与机器学习估算器均依赖模型规范或计算图，限制了对多样化架构与硬件变体的泛化能力。我们开源所有数据集、工具及制品以支持后续研究。