大语言模型代码生成中的逐字数据转录失败：一种状态追踪压力测试 (Verbatim Data Transcription Failures in LLM Code Generation: A State-Tracking Stress Test) - 专知论文

会员服务 ·

0

代码 · 转录 · 代码生成 · 基准 · 基准测试 ·

Verbatim Data Transcription Failures in LLM Code Generation: A State-Tracking Stress Test

翻译：大语言模型代码生成中的逐字数据转录失败：一种状态追踪压力测试

Mohd Ariful Haque,Kishor Datta Gupta,Mohammad Ashiqur Rahman,Roy George

Many real-world software tasks require exact transcription of provided data into code, such as cryptographic constants, protocol test vectors, allowlists, and calibration tables. These tasks are operationally sensitive because small omissions or alterations can remain silent while producing syntactically valid programs. This paper introduces a deliberately minimal transcription-to-code benchmark to isolate this reliability concern in LLM-based code generation. Given a list of high-precision decimal constants, a model must generate Python code that embeds the constants verbatim and performs a simple aggregate computation. We describe the prompting variants, evaluation protocol based on exact-string inclusion, and analysis framework used to characterize state-tracking and long-horizon generation failures. The benchmark is intended as a compact stress test that complements existing code-generation evaluations by focusing on data integrity rather than algorithmic reasoning.

翻译：许多现实世界的软件任务要求将提供的数据精确转录为代码，例如加密常数、协议测试向量、许可列表和校准表。这些任务在操作上具有敏感性，因为微小的遗漏或改动可能在产生语法有效程序的同时保持静默。本文引入了一个刻意最小化的转录到代码基准测试，以隔离基于大语言模型的代码生成中的这种可靠性问题。给定一个高精度十进制常数列表，模型必须生成嵌入这些常数并执行简单聚合计算的Python代码。我们描述了提示变体、基于精确字符串包含的评估协议，以及用于表征状态追踪和长时程生成失败的分析框架。该基准测试旨在作为一个紧凑的压力测试，通过聚焦于数据完整性而非算法推理，对现有代码生成评估形成补充。

0

相关内容

代码（Code）是专知网的一个重要知识资料文档板块，旨在整理收录论文源代码、复现代码，经典工程代码等，便于用户查阅下载使用。

【CVPR2024】DiffusionMTL: 从部分标注数据学习多任务去噪扩散模型

【CVPR2024】DiffusionMTL: 从部分标注数据学习多任务去噪扩散模型

专知会员服务

34+阅读 · 2024年3月25日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

ICCV'21 Oral｜拒绝调参，显著提点！检测分割任务的新损失函数RS Loss开源

专知会员服务

16+阅读 · 2021年8月11日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

专知会员服务

13+阅读 · 2020年4月9日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

专知

11+阅读 · 2020年3月17日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

基于上下文精化的并发对象活性的描述及验证

国家自然科学基金

1+阅读 · 2015年12月31日

基于对称识别方法的贝叶斯probit模型稳健性研究

国家自然科学基金

3+阅读 · 2015年12月31日

SDN数据平面中大规模流表的高性能查找方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

变换结构方程模型的非参数贝叶斯分析

国家自然科学基金

4+阅读 · 2014年12月31日

Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models

Arxiv

0+阅读 · 1月7日

InfiAgent: An Infinite-Horizon Framework for General-Purpose Autonomous Agents

InfiAgent: An Infinite-Horizon Framework for General-Purpose Autonomous Agents

Arxiv

0+阅读 · 1月6日

DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation

Arxiv

0+阅读 · 1月6日

ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation

Arxiv

0+阅读 · 1月6日

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Arxiv

0+阅读 · 1月6日

VIP会员

文章信息

相关主题

相关VIP内容

【CVPR2024】DiffusionMTL: 从部分标注数据学习多任务去噪扩散模型

【CVPR2024】DiffusionMTL: 从部分标注数据学习多任务去噪扩散模型

专知会员服务

34+阅读 · 2024年3月25日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

ICCV'21 Oral｜拒绝调参，显著提点！检测分割任务的新损失函数RS Loss开源

专知会员服务

16+阅读 · 2021年8月11日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

专知会员服务

13+阅读 · 2020年4月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《思考蜂群：基础、行为、拓扑与架构、认知、未来之路》400页书籍

【伯克利博士论文】协同语言智能体

新型军备竞赛：美军旨在争夺全球无人机主导地位

《乌克兰的无人机生态系统：经验教训》28页报告

相关资讯

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

专知

11+阅读 · 2020年3月17日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

相关论文

Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models

Arxiv

0+阅读 · 1月7日

InfiAgent: An Infinite-Horizon Framework for General-Purpose Autonomous Agents

InfiAgent: An Infinite-Horizon Framework for General-Purpose Autonomous Agents

Arxiv

0+阅读 · 1月6日

DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation

Arxiv

0+阅读 · 1月6日

ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation

Arxiv

0+阅读 · 1月6日

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Arxiv

0+阅读 · 1月6日

相关基金

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

基于上下文精化的并发对象活性的描述及验证

国家自然科学基金

1+阅读 · 2015年12月31日

基于对称识别方法的贝叶斯probit模型稳健性研究

国家自然科学基金

3+阅读 · 2015年12月31日

SDN数据平面中大规模流表的高性能查找方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

变换结构方程模型的非参数贝叶斯分析

国家自然科学基金

4+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员