AI_INFN平台：云端人工智能开发 (The AI_INFN Platform: Artificial Intelligence Development in the Cloud)

Machine Learning (ML) is profoundly reshaping the way researchers create, implement, and operate data-intensive software. Its adoption, however, introduces notable challenges for computing infrastructures, particularly when it comes to coordinating access to hardware accelerators across development, testing, and production environments. The INFN initiative AI_INFN (Artificial Intelligence at INFN) seeks to promote the use of ML methods across various INFN research scenarios by offering comprehensive technical support, including access to AI-focused computational resources. Leveraging the INFN Cloud ecosystem and cloud-native technologies, the project emphasizes efficient sharing of accelerator hardware while maintaining the breadth of the Institute's research activities. This contribution describes the deployment and commissioning of a Kubernetes-based platform designed to simplify GPU-powered data analysis workflows and enable their scalable execution on heterogeneous distributed resources. By integrating offloading mechanisms through Virtual Kubelet and the InterLink API, the platform allows workflows to span multiple resource providers, from Worldwide LHC Computing Grid sites to high-performance computing centers like CINECA Leonardo. We will present preliminary benchmarks, functional tests, and case studies, demonstrating both performance and integration outcomes.

翻译：机器学习（ML）正在深刻重塑研究人员创建、实施和运行数据密集型软件的方式。然而，其应用为计算基础设施带来了显著挑战，尤其是在协调开发、测试和生产环境中对硬件加速器的访问方面。INFN的AI_INFN（INFN人工智能）计划旨在通过提供全面的技术支持（包括访问专注于AI的计算资源），促进ML方法在各类INFN研究场景中的应用。该项目依托INFN云生态系统和云原生技术，强调在保持研究所广泛研究活动的同时，高效共享加速器硬件。本文描述了基于Kubernetes平台的部署与调试，该平台旨在简化基于GPU的数据分析工作流，并支持其在异构分布式资源上的可扩展执行。通过集成Virtual Kubelet和InterLink API的卸载机制，该平台允许工作流跨越多个资源提供者，从全球LHC计算网格站点到如CINECA Leonardo这样的高性能计算中心。我们将展示初步基准测试、功能测试和案例研究，以证明其性能与集成成果。

相关内容

关注 7093

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日