The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's internal model is trained using scikit-learn's GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR's output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers' output for annotating identifiers. The code is available on Github
翻译:本文介绍了源代码分析与词法标注运行时(SCALAR),这是一个专门用于将源代码标识符名称映射(标注)到对应词性标签序列(语法模式)的工具。SCALAR的内部模型采用scikit-learn的GradientBoostingClassifier并结合人工整理的标识符名称及其语法模式参考数据集进行训练。该设计使标注器能够专门识别开发者创建各类标识符(如函数名、变量名等)时所使用的自然语言独特结构。通过将SCALAR的输出与早期版本标注器及现代通用词性标注器的结果进行对比,展示了本工具在标识符标注任务上相较于其他标注器的改进效果。相关代码已在Github平台开源。