Scylla: Translating an Applicative Subset of C to Safe Rust

The popularity of the Rust language continues to explode; yet, many critical codebases remain authored in C. Automatically translating C to Rust is thus an appealing course of action. Several works have gone down this path, handling an ever-increasing subset of C through a variety of Rust features, such as unsafe. While the prospect of automation is appealing, producing code that relies on unsafe negates the memory safety guarantees offered by Rust, and therefore the main advantages of porting existing codebases to memory-safe languages. We instead advocate for a different approach, where the programmer iterates on the original C, gradually making the code more structured until it becomes eligible for compilation to safe Rust. This means that redesigns and rewrites can be evaluated incrementally for performance and correctness against existing test suites and production environments. Compiling structured C to safe Rust relies on the following contributions: a type-directed translation from (a subset of) C to safe Rust; a novel static analysis based on "split trees" which allows expressing C's pointer arithmetic using Rust's slices and splitting operations; an analysis that infers which borrows need to be mutable; and a compilation strategy for C pointer types that is compatible with Rust's distinction between non-owned and owned allocations. We evaluate our approach on real-world cryptographic libraries, binary parsers and serializers, and a file compression library. We show that these can be rewritten to Rust with small refactors of the original C code, and that the resulting Rust code exhibits similar performance characteristics as the original C code. As part of our translation process, we also identify and report undefined behaviors in the bzip2 compression library and in Microsoft's implementation of the FrodoKEM cryptographic primitive.

翻译：Rust语言的流行度持续激增；然而，许多关键代码库仍以C语言编写。因此，将C语言自动转换为Rust成为极具吸引力的解决方案。已有若干研究沿此路径展开，通过运用unsafe等Rust特性处理日益增多的C语言子集。尽管自动化转换前景诱人，但生成依赖unsafe的代码会抵消Rust提供的内存安全保证，从而丧失将现有代码库移植到内存安全语言的主要优势。我们主张采用不同方法：程序员通过迭代修改原始C代码，逐步增强代码结构性，直至其符合安全Rust的编译条件。这意味着可以在现有测试套件和生产环境中，逐步评估重构方案在性能与正确性方面的表现。将结构化C代码编译为安全Rust依赖于以下贡献：基于类型导向的C语言（子集）到安全Rust的转换方法；基于"分裂树"的新型静态分析技术，可利用Rust的切片与分裂操作表达C语言的指针运算；推断可变借用需求的静态分析；以及与Rust非拥有/拥有分配机制兼容的C指针类型编译策略。我们在真实世界的密码学库、二进制解析器/序列化器和文件压缩库上评估该方法，证明通过适度重构原始C代码即可将其转换为Rust，且生成的Rust代码展现出与原始C代码相近的性能特征。在转换过程中，我们还发现并报告了bzip2压缩库及微软FrodoKEM密码原语实现中存在的未定义行为。