Code translation is one of the core capabilities of LLMs. However, evaluating the correctness of translations remains difficult, as commonly used metrics such as BLEU measure only syntactic similarity, disregarding program semantics. We propose a novel evaluation methodology for code translation tasks, emphasizing semantic equivalence over surface-level string similarity. Our approach applies established compiler testing methodology to a new domain, allowing the assessment of an LLM fine-tuned for binary lifting tasks (i.e. decompiling binaries to higher-level representations). We introduce a semantic correctness score, defined as the proportion of translations that produce correct execution outcomes, and demonstrate its application by evaluating LLM-based and heuristic decompilers. Our findings show that LLM-based approaches significantly outperform heuristic ones, while BLEU scores show negligible correlation with semantic correctness (r = -0.127 to 0.354), demonstrating that syntactic metrics fail to predict functional accuracy.
翻译:代码翻译是大语言模型的核心能力之一。然而,评估翻译正确性仍然困难,因为BLEU等常用指标仅衡量词法相似度,忽略了程序语义。我们提出一种针对代码翻译任务的新型评估方法,强调语义等效性而非表层字符串相似度。我们的方法将成熟的编译器测试方法应用于新领域,能够评估针对二进制提升任务(即将二进制文件反编译为高层表示)微调的大语言模型。我们引入语义正确性分数,定义为产生正确执行结果的翻译比例,并通过评估基于大语言模型和启发式方法的反编译器来展示其应用。研究结果表明,基于大语言模型的方法显著优于启发式方法,而BLEU分数与语义正确性相关性极低(r = -0.127至0.354),证明词法指标无法预测功能准确性。