This report focuses on enhancing a binary code comment quality classification model by integrating generated code and comment pairs, to improve model accuracy. The dataset comprises 9048 pairs of code and comments written in the C programming language, each annotated as "Useful" or "Not Useful." Additionally, code and comment pairs are generated using a Large Language Model Architecture, and these generated pairs are labeled to indicate their utility. The outcome of this effort consists of two classification models: one utilizing the original dataset and another incorporating the augmented dataset with the newly generated code comment pairs and labels.
翻译:本报告聚焦于通过融合生成的代码与注释对来增强二进制代码注释质量分类模型,从而提升模型准确率。数据集包含9048对用C语言编写的代码与注释,每对标注为“有用”或“无用”。此外,利用大型语言模型架构生成代码与注释对,并对这些生成对进行效用标注。此项工作的成果包括两个分类模型:一个使用原始数据集,另一个则结合了增强后的数据集,其中包含新生成的代码注释对及其标签。