Parallel programming is central to HPC and AI, but producing code that is correct and fast remains challenging, especially for OpenMP GPU offload, where data movement and tuning dominate. Autonomous coding agents can compile, test, and profile on target hardware, but outputs are brittle without domain scaffolding. We present ParaCodex, an HPC-engineer workflow that turns a Codex-based agent into an autonomous OpenMP GPU offload system using staged hotspot analysis, explicit data planning, correctness gating, and profiling-guided refinement. We evaluate translation from serial CPU kernels to OpenMP GPU offload kernels on HeCBench, Rodinia, and NAS. After excluding five kernels, ParaCodex succeeded on all 31 valid kernels. The generated kernels improved GPU time over reference OpenMP implementations in 25/31 cases, achieving geometric-mean speedups of 3x on HeCBench and 5x on Rodinia, and outperforming a zero-shot Codex baseline on all suites. We also evaluate CUDA to OpenMP offload translation on ParEval, where ParaCodex maintains high compilation and validation rates in code-only and end-to-end settings.
翻译:并行编程是高性能计算与人工智能领域的核心,但生成既正确又高效的代码仍具挑战性,尤其在以数据移动与调优为主导的OpenMP GPU卸载场景中。自主编码代理能够在目标硬件上编译、测试和剖析代码,但若缺乏领域框架支撑,其输出结果往往脆弱。本文提出ParaCodex——一种面向高性能计算工程师的工作流程,通过分阶段热点分析、显式数据规划、正确性门控机制以及剖析引导的优化,将基于Codex的代理转化为自主的OpenMP GPU卸载系统。我们在HeCBench、Rodinia和NAS基准测试集上评估了从串行CPU内核到OpenMP GPU卸载内核的翻译效果。在排除五个内核后,ParaCodex在全部31个有效内核上均获成功。生成的内核在25/31的案例中优于参考OpenMP实现的GPU运行时间,在HeCBench和Rodinia上分别实现3倍与5倍的几何平均加速比,并在所有测试集上超越零样本Codex基线。此外,我们在ParEval上评估了CUDA至OpenMP卸载的翻译任务,ParaCodex在纯代码与端到端两种设置下均保持了较高的编译通过率与验证成功率。