Jobs on high-performance computing (HPC) clusters can suffer significant performance degradation due to inter-job network interference. Topology-aware job allocation problem (TJAP) is such a problem that decides how to dedicate nodes to specific applications to mitigate inter-job network interference. In this paper, we study the window-based TJAP on a fat-tree network aiming at minimizing the cost of communication hop, a defined inter-job interference metric. The window-based approach for scheduling repeats periodically taking the jobs in the queue and solving an assignment problem that maps jobs to the available nodes. Two special allocation strategies are considered, i.e., static continuity assignment strategy (SCAS) and dynamic continuity assignment strategy (DCAS). For the SCAS, a 0-1 integer programming is developed. For the DCAS, an approach called neural simulated algorithm (NSA), which is an extension to simulated algorithm (SA) that learns a repair operator and employs them in a guided heuristic search, is proposed. The efficacy of NSA is demonstrated with a computational study against SA and SCIP. The results of numerical experiments indicate that both the model and algorithm proposed in this paper are effective.
翻译:高性能计算集群中的作业可能因作业间网络干扰而导致性能显著下降。拓扑感知作业分配问题(TJAP)正是解决如何将节点专用于特定应用以减轻这种干扰的问题。本文研究了胖树网络环境下基于时间窗口的拓扑感知作业分配问题,旨在最小化通信跳数成本——一种定义的作业间干扰度量指标。基于时间窗口的调度方法通过周期性地处理队列中的作业并求解将作业映射到可用节点的分配问题来实现。本文考虑了两种特殊的分配策略,即静态连续分配策略(SCAS)和动态连续分配策略(DCAS)。针对SCAS,建立了0-1整数规划模型;针对DCAS,提出了一种称为神经模拟退火算法(NSA)的方法,该方法是对模拟退火算法(SA)的扩展,通过学习修复算子并将其应用于引导式启发搜索。通过对比SA和SCIP的计算实验证明了NSA的有效性。数值实验结果表明本文提出的模型和算法均具有有效性。