Cross-architecture GPU code transpilation is essential for unlocking low-level hardware portability, yet no scalable solution exists. We introduce CASS, the first dataset and model suite for source- and assembly-level GPU translation (CUDA <--> HIP, SASS <--> RDNA3). CASS contains 60k verified host-device code pairs, enabling learning-based translation across both ISA and runtime boundaries. We generate each sample using our automated pipeline that scrapes, translates, compiles, and aligns GPU programs across vendor stacks. Leveraging CASS, we train a suite of domain-specific translation models that achieve 88.2% accuracy on CUDA -> HIP and 69.1% on SASS -> RDNA3, outperforming commercial baselines including GPT-5.1, Claude-4.5, and Hipify by wide margins. Generated code matches native performance in 85% of cases, preserving both runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 18 GPU domains with ground-truth execution. All data, models, and evaluation tools will be released as open source to support progress in GPU compiler tooling, binary compatibility, and LLM-guided code translation.
翻译:跨架构GPU代码转译对于实现底层硬件可移植性至关重要,但目前尚无可扩展的解决方案。本文提出CASS——首个面向源码级与汇编级GPU翻译(CUDA <--> HIP、SASS <--> RDNA3)的数据集与模型套件。CASS包含6万对经过验证的主机-设备代码对,支持跨指令集架构与运行时边界的基于学习的翻译。我们通过自动化流水线生成每个样本,该流水线可跨供应商技术栈完成GPU程序的抓取、翻译、编译与对齐。基于CASS,我们训练了一系列领域专用翻译模型,在CUDA→HIP和SASS→RDNA3任务上分别达到88.2%和69.1%的准确率,大幅超越包括GPT-5.1、Claude-4.5和Hipify在内的商业基线。生成的代码在85%的案例中实现了与原生代码同等的性能,完整保留了运行时与内存行为特征。为支持严谨评估,我们引入CASS-Bench——一个覆盖18个GPU领域、包含真实执行基准的精选基准测试集。所有数据、模型与评估工具将作为开源资源发布,以推动GPU编译器工具链、二进制兼容性及大语言模型引导的代码翻译研究进展。