VibeTensor：由AI智能体完全生成的深度学习系统软件 (VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents)

Bing Xu,Terry Chen,Fengzhe Zhou,Tianqi Chen,Yangqing Jia,Vinod Grover,Haicheng Wu,Wei Liu,Craig Wittenbrink,Wen-mei Hwu,Roger Bringmann,Ming-Yu Liu,Luis Ceze,Michael Lightstone,Humphrey Shi

from arxiv, Open-source: https://github.com/NVLabs/vibetensor

VIBETENSOR is an open-source research system software stack for deep learning, generated by LLM-powered coding agents under high-level human guidance. In this paper, "fully generated" refers to code provenance: implementation changes were produced and applied as agent-proposed diffs; validation relied on agent-run builds, tests, and differential checks, without per-change manual diff review. It implements a PyTorch-style eager tensor library with a C++20 core (CPU+CUDA), a torch-like Python overlay via nanobind, and an experimental Node.js/TypeScript interface. Unlike thin bindings, VIBETENSOR includes its own tensor/storage system, schema-lite dispatcher, reverse-mode autograd, CUDA runtime (streams/events/graphs), a stream-ordered caching allocator with diagnostics, and a stable C ABI for dynamically loaded operator plugins. We view this release as a milestone for AI-assisted software engineering: it shows coding agents can generate a coherent deep learning runtime spanning language bindings down to CUDA memory management, validated primarily by builds and tests. We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact. We report repository scale and test-suite composition, and summarize reproducible microbenchmarks from an accompanying AI-generated kernel suite, including fused attention versus PyTorch SDPA/FlashAttention. We also report end-to-end training sanity checks on 3 small workloads (sequence reversal, ViT, miniGPT) on NVIDIA H100 (Hopper, SM90) and Blackwell-class GPUs; multi-GPU results are Blackwell-only and use an optional CUTLASS-based ring-allreduce plugin gated on CUDA 13+ and sm103a toolchain support. Finally, we discuss failure modes in generated system software, including a "Frankenstein" composition effect where locally correct subsystems interact to yield globally suboptimal performance.

翻译：VIBETENSOR是一个开源的深度学习研究系统软件栈，由基于大语言模型的编码智能体在高层级人类指导下生成。本文中，“完全生成”指代码来源：实现变更以智能体提议的差异文件形式生成并应用；验证依赖于智能体运行的构建、测试和差分检查，无需对每个变更进行人工差异审查。该系统实现了一个PyTorch风格的即时执行张量库，其核心采用C++20（支持CPU与CUDA后端），通过nanobind提供类torch的Python接口层，并包含实验性的Node.js/TypeScript接口。与轻量绑定方案不同，VIBETENSOR包含自主实现的张量/存储系统、轻量模式分发器、反向模式自动微分、CUDA运行时（流/事件/计算图）、具备诊断功能的流序缓存分配器，以及用于动态加载算子插件的稳定C ABI接口。我们将此次发布视为AI辅助软件工程的重要里程碑：它表明编码智能体能够生成跨越从语言绑定层到CUDA内存管理层的完整深度学习运行时系统，且主要依赖构建与测试流程进行验证。本文阐述了系统架构，总结了用于生成与验证该系统的全流程工作方法，并对成果进行了评估。我们报告了代码库规模与测试套件构成，并基于配套的AI生成内核套件提供了可复现的微基准测试结果，包括融合注意力机制与PyTorch SDPA/FlashAttention的对比分析。同时，我们在NVIDIA H100（Hopper架构，SM90）和Blackwell架构GPU上对三个轻量级工作负载（序列反转、ViT、miniGPT）进行了端到端训练完整性验证；多GPU测试仅在Blackwell架构上开展，并使用了基于CUTLASS实现的可选环形全归约插件，该功能需CUDA 13+及sm103a工具链支持。最后，我们探讨了生成式系统软件中的典型故障模式，包括“弗兰肯斯坦”式组合效应——即局部正确的子系统相互组合却导致全局性能次优的现象。