Collocating deep learning training tasks improves GPU utilization but causes drastic slowdowns due to resource contention and risks Out-of-Memory (OOM) failures. Accurate memory estimation is essential for robust collocation, while GPU utilization -- a key proxy for resource contention -- enables interference-aware scheduling to reduce slowdowns and improve throughput. Existing GPU memory estimators span three paradigms -- analytical models, CPU-side libraries, and ML-based estimators -- each with distinct limitations: dependence on detailed model specifications, intrusive integration, poor generalization, and varying latency overhead. GPU heterogeneity further complicates estimation, as identical tasks can exhibit markedly different memory footprints across hardware generations. GPU utilization remains comparatively understudied, further complicated by the non-additive nature of utilization metrics and hardware sensitivity. We conduct a systematic analysis of representative estimators from each paradigm -- Horus, PyTorch FakeTensor, and our lightweight ML-based estimator -- evaluating accuracy, generalizability, and practical overhead. We construct a synthetic dataset spanning MLPs, CNNs, and Transformers with controlled architectural variations, and train MLP- and Transformer-based estimators for memory prediction. We further experiment with utilization estimation on the same dataset. Our evaluation reveals key tradeoffs and validates estimators against real-world unseen models. Significant challenges remain: analytical models are hardware-dependent, CPU-side libraries impose intrusive integration costs, and ML-based estimators struggle with cross-architecture generalization. We release all datasets, tools, and artifacts to support further research.
翻译:并置深度学习训练任务可提升GPU利用率,但由于资源争用会导致性能急剧下降,并存在内存溢出(OOM)故障风险。精确的内存估计是实现稳健任务并置的关键,而作为资源争用关键代理指标的GPU利用率,则支持基于干扰感知的调度以减少性能下降并提升吞吐量。现有GPU内存估计器涵盖三大范式——解析模型、CPU端库和基于机器学习的估计器——各自存在明显局限:依赖详细模型规格、侵入式集成、泛化能力差以及不同程度的延迟开销。GPU异构性进一步增加了估计难度,相同任务在不同硬件代际上可能呈现显著差异的内存占用。GPU利用率的研究相对不足,利用率指标的非可加性及硬件敏感性更增加了其复杂性。我们对各范式的代表性估计器——Horus、PyTorch FakeTensor及我们提出的轻量级机器学习估计器——进行了系统分析,评估其准确性、泛化能力和实际开销。我们构建了涵盖多层感知机、卷积神经网络和Transformer的合成数据集,通过控制架构变量,训练了基于多层感知机和Transformer的内存预测模型。我们进一步在同一数据集上进行了利用率估计实验。评估结果揭示了关键权衡,并在真实场景未见模型上验证了估计器的有效性。仍存在重大挑战:解析模型具有硬件依赖性,CPU端库带来侵入式集成成本,而基于机器学习的估计器在跨架构泛化方面存在困难。我们公开了全部数据集、工具及实验制品以支持后续研究。