We present a neuro-symbolic (NeSy) workflow combining a symbolic-based learning technique with a large language model (LLM) agent to generate synthetic data for code comment classification in the C programming language. We also show how generating controlled synthetic data using this workflow fixes some of the notable weaknesses of LLM-based generation and increases the performance of classical machine learning models on the code comment classification task. Our best model, a Neural Network, achieves a Macro-F1 score of 91.412% with an increase of 1.033% after data augmentation.
翻译:我们提出了一种神经符号(NeSy)工作流,结合基于符号的学习技术与大语言模型(LLM)智能体,为C语言代码注释分类任务生成合成数据。同时展示了如何通过该工作流生成受控合成数据,以弥补基于LLM生成方法的显著缺陷,并提升经典机器学习模型在代码注释分类任务中的性能。我们最优模型(神经网络)在数据增强后,宏F1分数达到91.412%,提升幅度为1.033%。