Reinforcement Learning-Based Dynamic Management of Structured Parallel Farm Skeletons on Serverless Platforms

We present a framework for dynamic management of structured parallel processing skeletons on serverless platforms. Our goal is to bring HPC-like performance and resilience to serverless and continuum environments while preserving the programmability benefits of skeletons. As a first step, we focus on the well known Farm pattern and its implementation on the open-source OpenFaaS platform, treating autoscaling of the worker pool as a QoS-aware resource management problem. The framework couples a reusable farm template with a Gymnasium-based monitoring and control layer that exposes queue, timing, and QoS metrics to both reactive and learning-based controllers. We investigate the effectiveness of AI-driven dynamic scaling for managing the farm's degree of parallelism via the scalability of serverless functions on OpenFaaS. In particular, we discuss the autoscaling model and its training, and evaluate two reinforcement learning (RL) policies against a baseline of reactive management derived from a simple farm performance model. Our results show that AI-based management can better accommodate platform-specific limitations than purely model-based performance steering, improving QoS while maintaining efficient resource usage and stable scaling behaviour.

翻译：我们提出了一种在无服务器平台上动态管理结构化并行处理骨架的框架。我们的目标是在保持骨架可编程性优势的同时，为无服务器及连续体环境带来类似高性能计算的性能与弹性。作为第一步，我们聚焦于经典的Farm模式及其在开源OpenFaaS平台上的实现，将工作池的自动伸缩视为一个服务质量感知的资源管理问题。该框架将可复用的farm模板与基于Gymnasium的监控控制层相结合，该控制层向反应式及基于学习的控制器同时暴露队列、时序和服务质量指标。我们通过OpenFaaS上无服务器函数的可扩展性，研究了人工智能驱动的动态伸缩在管理farm并行度方面的有效性。具体而言，我们讨论了自动伸缩模型及其训练过程，并评估了两种强化学习策略与基于简单farm性能模型推导出的反应式管理基线。结果表明，相较于纯基于模型的性能调控，基于人工智能的管理能更好地适应平台特定的限制，在维持高效资源利用和稳定伸缩行为的同时提升服务质量。

相关内容

服务器

关注 14

服务器，也称伺服器，是提供计算服务的设备。由于服务器需要响应服务请求，并进行处理，因此一般来说服务器应具备承担服务并且保障服务的能力。
服务器的构成包括处理器、硬盘、内存、系统总线等，和通用的计算机架构类似，但是由于需要提供高可靠的服务，因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。

《抗干扰协同无人机中继网络的多智能体深度强化学习》

专知会员服务

28+阅读 · 2025年12月31日

《基于多智能体强化学习的异构平台数据驱动分布式共同作战图》最新论文

专知会员服务

72+阅读 · 2025年2月21日

【新书】深度强化学习在可重构智能表面和无人机赋能智能6G通信中的应用

专知会员服务

27+阅读 · 2025年1月3日

《基于多智能体强化学习的异构平台数据驱动分布式共同作战图景》

专知会员服务

91+阅读 · 2024年12月2日