Many real-world software tasks require exact transcription of provided data into code, such as cryptographic constants, protocol test vectors, allowlists, and calibration tables. These tasks are operationally sensitive because small omissions or alterations can remain silent while producing syntactically valid programs. This paper introduces a deliberately minimal transcription-to-code benchmark to isolate this reliability concern in LLM-based code generation. Given a list of high-precision decimal constants, a model must generate Python code that embeds the constants verbatim and performs a simple aggregate computation. We describe the prompting variants, evaluation protocol based on exact-string inclusion, and analysis framework used to characterize state-tracking and long-horizon generation failures. The benchmark is intended as a compact stress test that complements existing code-generation evaluations by focusing on data integrity rather than algorithmic reasoning.
翻译:许多现实世界的软件任务要求将提供的数据精确转录为代码,例如加密常数、协议测试向量、许可列表和校准表。这些任务在操作上具有敏感性,因为微小的遗漏或改动可能在产生语法有效程序的同时保持静默。本文引入了一个刻意最小化的转录到代码基准测试,以隔离基于大语言模型的代码生成中的这种可靠性问题。给定一个高精度十进制常数列表,模型必须生成嵌入这些常数并执行简单聚合计算的Python代码。我们描述了提示变体、基于精确字符串包含的评估协议,以及用于表征状态追踪和长时程生成失败的分析框架。该基准测试旨在作为一个紧凑的压力测试,通过聚焦于数据完整性而非算法推理,对现有代码生成评估形成补充。