We present a framework for dynamic management of structured parallel processing skeletons on serverless platforms. Our goal is to bring HPC-like performance and resilience to serverless and continuum environments while preserving the programmability benefits of skeletons. As a first step, we focus on the well known Farm pattern and its implementation on the open-source OpenFaaS platform, treating autoscaling of the worker pool as a QoS-aware resource management problem. The framework couples a reusable farm template with a Gymnasium-based monitoring and control layer that exposes queue, timing, and QoS metrics to both reactive and learning-based controllers. We investigate the effectiveness of AI-driven dynamic scaling for managing the farm's degree of parallelism via the scalability of serverless functions on OpenFaaS. In particular, we discuss the autoscaling model and its training, and evaluate two reinforcement learning (RL) policies against a baseline of reactive management derived from a simple farm performance model. Our results show that AI-based management can better accommodate platform-specific limitations than purely model-based performance steering, improving QoS while maintaining efficient resource usage and stable scaling behaviour.
翻译:我们提出了一种在无服务器平台上动态管理结构化并行处理骨架的框架。我们的目标是在保持骨架可编程性优势的同时,为无服务器及连续体环境带来类似高性能计算的性能与弹性。作为第一步,我们聚焦于经典的Farm模式及其在开源OpenFaaS平台上的实现,将工作池的自动伸缩视为一个服务质量感知的资源管理问题。该框架将可复用的farm模板与基于Gymnasium的监控控制层相结合,该控制层向反应式及基于学习的控制器同时暴露队列、时序和服务质量指标。我们通过OpenFaaS上无服务器函数的可扩展性,研究了人工智能驱动的动态伸缩在管理farm并行度方面的有效性。具体而言,我们讨论了自动伸缩模型及其训练过程,并评估了两种强化学习策略与基于简单farm性能模型推导出的反应式管理基线。结果表明,相较于纯基于模型的性能调控,基于人工智能的管理能更好地适应平台特定的限制,在维持高效资源利用和稳定伸缩行为的同时提升服务质量。