Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization estimation -- a key proxy for contention -- enables interference-aware scheduling. Existing GPU memory estimators span three paradigms -- analytical models, CPU-side libraries, and ML-based estimators -- each with distinct limitations: dependence on detailed model specifications, intrusive integration, poor generalization, and varying latency overhead. GPU heterogeneity further complicates estimation, as identical tasks can exhibit different memory footprints across hardware generations. GPU utilization remains comparatively understudied, further complicated by non-additive utilization metrics and GPU heterogeneity. We conduct a systematic analysis of representative estimators from each paradigm -- Horus, PyTorch FakeTensor, and our lightweight ML-based estimator -- evaluating accuracy, generalizability, and overhead. We construct a synthetic dataset spanning MLPs, CNNs, and Transformers with controlled architectural variations, and train MLP- and Transformer-based estimators for memory prediction, and experiment with utilization estimation. Our evaluation reveals key tradeoffs and validates estimators against real-world unseen models. Significant challenges remain: analytical models lack generalization and cannot easily be extended to new GPU architectures or accurately reflect memory optimization savings; CPU-side libraries impose intrusive integration overhead; and both analytical and ML-based estimators rely on model specifications or computation graphs, limiting generalization across diverse architectures and hardware variants. We release all datasets, tools, and artifacts to support further research.
翻译:并置深度学习训练任务可提升GPU利用率,但会带来资源争用、严重性能下降及内存溢出(OOM)故障的风险。精确的内存估计是实现稳健任务并置的关键,而作为争用核心代理指标的GPU利用率估计则支持感知干扰的调度。现有GPU内存估计器涵盖三大范式——解析模型、CPU端库及基于机器学习的估计器——各自存在明显局限:依赖详细模型规格、侵入式集成、泛化能力差及不同程度的延迟开销。GPU异构性进一步增加了估计难度,相同任务在不同代硬件上可能呈现不同的内存占用特征。GPU利用率估计的研究相对不足,且因利用率指标的非可加性及GPU异构性而更趋复杂。我们对各范式的代表性估计器——Horus、PyTorch FakeTensor及我们提出的轻量级基于机器学习的估计器——进行了系统分析,评估其准确性、泛化能力与开销。我们构建了涵盖MLP、CNN及Transformer架构且具有受控结构变体的合成数据集,训练了基于MLP和Transformer的内存预测模型,并开展了利用率估计实验。评估结果揭示了关键权衡关系,并在真实场景未见模型上验证了估计器的有效性。仍存在重大挑战:解析模型缺乏泛化能力,难以扩展至新型GPU架构或准确反映内存优化收益;CPU端库带来侵入式集成开销;而解析模型与基于机器学习的估计器均依赖模型规格或计算图,限制了其在多样化架构与硬件变体间的泛化能力。我们公开全部数据集、工具及实验制品以支持后续研究。