Large Language Models (LLMs) have revolutionized code generation, achieving exceptional results on various established benchmarking frameworks. However, concerns about data contamination - where benchmark data inadvertently leaks into pre-training or fine-tuning datasets - raise questions about the validity of these evaluations. While this issue is known, limiting the industrial adoption of LLM-driven software engineering, hardware coding has received little to no attention regarding these risks. For the first time, we analyze state-of-the-art (SOTA) evaluation frameworks for Verilog code generation (VerilogEval and RTLLM), using established methods for contamination detection (CCD and Min-K% Prob). We cover SOTA commercial and open-source LLMs (CodeGen2.5, Minitron 4b, Mistral 7b, phi-4 mini, LLaMA-{1,2,3.1}, GPT-{2,3.5,4o}, Deepseek-Coder, and CodeQwen 1.5), in baseline and fine-tuned models (RTLCoder and Verigen). Our study confirms that data contamination is a critical concern. We explore mitigations and the resulting trade-offs for code quality vs fairness (i.e., reducing contamination toward unbiased benchmarking).
翻译:大型语言模型(LLM)已彻底改变代码生成领域,在各种成熟的基准测试框架中取得了卓越成果。然而,数据污染问题——即基准测试数据无意中泄露至预训练或微调数据集——对这些评估的有效性提出了质疑。尽管该问题已广为人知并限制了LLM驱动软件工程的工业应用,但硬件编码领域在此类风险方面却鲜有关注。本研究首次采用成熟的污染检测方法(CCD与Min-K% Prob),对Verilog代码生成的先进评估框架(VerilogEval与RTLLM)进行分析。我们涵盖了当前领先的商业与开源LLM(CodeGen2.5、Minitron 4b、Mistral 7b、phi-4 mini、LLaMA-{1,2,3.1}、GPT-{2,3.5,4o}、Deepseek-Coder及CodeQwen 1.5)的基线模型与微调模型(RTLCoder与Verigen)。研究证实数据污染是亟待解决的关键问题。我们进一步探讨了缓解方案及其在代码质量与公平性(即降低污染以实现无偏基准测试)之间产生的权衡关系。