Machine Learning (ML) is profoundly reshaping the way researchers create, implement, and operate data-intensive software. Its adoption, however, introduces notable challenges for computing infrastructures, particularly when it comes to coordinating access to hardware accelerators across development, testing, and production environments. The INFN initiative AI_INFN (Artificial Intelligence at INFN) seeks to promote the use of ML methods across various INFN research scenarios by offering comprehensive technical support, including access to AI-focused computational resources. Leveraging the INFN Cloud ecosystem and cloud-native technologies, the project emphasizes efficient sharing of accelerator hardware while maintaining the breadth of the Institute's research activities. This contribution describes the deployment and commissioning of a Kubernetes-based platform designed to simplify GPU-powered data analysis workflows and enable their scalable execution on heterogeneous distributed resources. By integrating offloading mechanisms through Virtual Kubelet and the InterLink API, the platform allows workflows to span multiple resource providers, from Worldwide LHC Computing Grid sites to high-performance computing centers like CINECA Leonardo. We will present preliminary benchmarks, functional tests, and case studies, demonstrating both performance and integration outcomes.
翻译:机器学习(ML)正在深刻重塑研究人员创建、实施和运行数据密集型软件的方式。然而,其应用为计算基础设施带来了显著挑战,尤其是在协调开发、测试和生产环境中对硬件加速器的访问方面。INFN的AI_INFN(INFN人工智能)计划旨在通过提供全面的技术支持(包括访问专注于AI的计算资源),促进ML方法在各类INFN研究场景中的应用。该项目依托INFN云生态系统和云原生技术,强调在保持研究所广泛研究活动的同时,高效共享加速器硬件。本文描述了基于Kubernetes平台的部署与调试,该平台旨在简化基于GPU的数据分析工作流,并支持其在异构分布式资源上的可扩展执行。通过集成Virtual Kubelet和InterLink API的卸载机制,该平台允许工作流跨越多个资源提供者,从全球LHC计算网格站点到如CINECA Leonardo这样的高性能计算中心。我们将展示初步基准测试、功能测试和案例研究,以证明其性能与集成成果。