Modern machine learning (ML) has grown into a tightly coupled, full-stack ecosystem that combines hardware, software, network, and applications. Many users rely on cloud providers for elastic, isolated, and cost-efficient resources. Unfortunately, these platforms as a service use virtualization, which means operators have little insight into the users' workloads. This hinders resource optimizations by the operator, which is essential to ensure cost efficiency and minimize execution time. In this paper, we argue that workload knowledge is unnecessary for system-level optimization. We propose System-X, which takes a \emph{hardware-centric} approach, relying only on hardware signals -- fully accessible by operators. Using low-level signals collected from the system, System-X detects anomalies through an unsupervised learning pipeline. The pipeline is developed by analyzing over 30 popular ML models on various hardware platforms, ensuring adaptability to emerging workloads and unknown deployment patterns. Using System-X, we successfully identified both network and system configuration issues, accelerating the DeepSeek model by 5.97%.
翻译:现代机器学习(ML)已发展为一个紧密结合的全栈生态系统,集硬件、软件、网络与应用于一体。许多用户依赖云服务提供商获取弹性、隔离且成本高效的资源。然而,这些平台即服务(PaaS)采用虚拟化技术,导致运营商对用户工作负载的可见性有限。这阻碍了运营商进行资源优化,而优化对于确保成本效益和最小化执行时间至关重要。本文主张,系统级优化无需依赖工作负载的具体知识。我们提出System-X,采用以硬件为中心的方法,仅依赖运营商可完全访问的硬件信号。通过从系统收集的低层信号,System-X利用无监督学习流程检测异常。该流程通过分析超过30种流行ML模型在不同硬件平台上的表现而构建,确保了对新兴工作负载和未知部署模式的适应性。借助System-X,我们成功识别了网络与系统配置问题,将DeepSeek模型的运行速度提升了5.97%。