Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournaments

知识 (knowledge) · 优化器 · Learning · MLOps · ML ·

2024 年 12 月 2 日

翻译：通过Collective Mind、虚拟化MLOps、MLPerf、Collective Knowledge Playground与可复现优化竞赛实现更高效、更具成本效益的AI/ML系统

Grigori Fursin

This white paper introduces my educational community initiative to learn how to run AI, ML and other emerging workloads in the most efficient and cost-effective way across diverse models, data sets, software and hardware. This project leverages Collective Mind (CM), virtualized MLOps and DevOps (CM4MLOps), MLPerf benchmarks, and the Collective Knowledge playground (CK), which I have developed in collaboration with the community and MLCommons. I created Collective Mind as a small and portable Python package with minimal dependencies, a unified CLI and Python API to help researchers and engineers automate repetitive, tedious, and time-consuming tasks. I also designed CM as a distributed framework, continuously enhanced by the community through the CM4* repositories, which function as the unified interface for organizing and managing various collections of automations and artifacts. For example, CM4MLOps repository includes many automations, also known as CM scripts, to streamline the process of building, running, benchmarking, and optimizing AI, ML, and other workflows across ever-evolving models, data, and systems. I donated CK, CM and CM4MLOps to MLCommons to foster collaboration between academia and industry to learn how to co-design more efficient and cost-effective AI systems while capturing and encoding knowledge within Collective Mind, protecting intellectual property, enabling portable skills, and accelerating the transition of the state-of-the-art research into production. My ultimate goal is to collaborate with the community to complete my two-decade journey toward creating self-optimizing software and hardware that can automatically learn how to run any workload in the most efficient and cost-effective manner based on user requirements and constraints such as cost, latency, throughput, accuracy, power consumption, size, and other critical factors.

翻译：本白皮书介绍了一项教育性社区倡议，旨在探索如何在多样化的模型、数据集、软件和硬件上以最高效且最具成本效益的方式运行人工智能、机器学习及其他新兴工作负载。该项目利用了我与社区及MLCommons共同开发的Collective Mind（CM）、虚拟化MLOps与DevOps（CM4MLOps）、MLPerf基准测试以及Collective Knowledge playground（CK）。我将Collective Mind设计为一个轻量级、可移植的Python包，其依赖项极少，并提供统一的命令行界面和Python API，以帮助研究人员和工程师自动化重复性、繁琐且耗时的任务。同时，我将CM构建为一个分布式框架，通过CM4*代码库由社区持续增强；这些代码库作为统一接口，用于组织和管理各类自动化脚本与构件。例如，CM4MLOps代码库包含众多自动化脚本（亦称CM脚本），可简化为适应不断演进的模型、数据和系统而构建、运行、基准测试及优化AI、ML等工作流程的过程。我已将CK、CM及CM4MLOps捐赠给MLCommons，以促进学术界与工业界的协作，共同探索如何协同设计更高效、更具成本效益的AI系统，同时将知识捕获并编码于Collective Mind中，保护知识产权，实现可移植技能，并加速前沿研究向生产环境的转化。我的最终目标是与社区合作，完成我历时二十年的探索之旅：创建能够根据用户需求及成本、延迟、吞吐量、精度、功耗、规模等关键约束条件，自动学习以最高效、最具成本效益的方式运行任何工作负载的自优化软件与硬件。