Automatic Program translation has enormous application value and hence has been attracting significant interest from AI researchers. However, we observe that current program translation models still make elementary syntax errors, particularly, when the target language does not have syntax elements in the source language. Metrics like BLUE, CodeBLUE and computation accuracy may not expose these issues. In this paper we introduce a new metrics for programming language translation and these metrics address these basic syntax errors. We develop a novel active defects probing suite called Syntactic Unit Tests (SUT) which includes a highly interpretable evaluation harness for accuracy and test scoring. Experiments have shown that even powerful models like ChatGPT still make mistakes on these basic unit tests. Specifically, compared to previous program translation task evaluation dataset, its pass rate on our unit tests has decreased by 26.15%. Further our evaluation harness reveal syntactic element errors in which these models exhibit deficiencies.
翻译:摘要:自动程序翻译具有巨大的应用价值,因此一直吸引着人工智能研究者的广泛关注。然而,我们观察到当前程序翻译模型仍会犯基本的语法错误,尤其是在目标语言缺乏源语言中存在的语法元素时。BLEU、CodeBLEU及计算精度等指标可能无法暴露这些问题。本文提出了一种用于编程语言翻译的新指标,该指标能够解决这些基本语法错误。我们开发了一套名为句法单元测试(SUT)的新型主动缺陷探测套件,其中包含一个用于准确性和测试评分的高度可解释的评估框架。实验表明,即便是像ChatGPT这样强大的模型,仍会在这些基础单元测试中犯错。具体而言,与之前的程序翻译任务评估数据集相比,它在我们单元测试上的通过率下降了26.15%。此外,我们的评估框架揭示了这些模型存在不足的句法元素错误。