While constructing supervised learning models, we require labelled examples to build a corpus and train a machine learning model. However, most studies have built the labelled dataset manually, which in many occasions is a daunting task. To mitigate this problem, we have built an online tool called CodeLabeller. CodeLabeller is a web-based tool that aims to provide an efficient approach to handling the process of labelling source code files for supervised learning methods at scale by improving the data collection process throughout. CodeLabeller is tested by constructing a corpus of over a thousand source files obtained from a large collection of open source Java projects and labelling each Java source file with their respective design patterns and summaries. Twenty five experts in the field of software engineering participated in a usability evaluation of the tool using the standard User Experience Questionnaire online survey. The survey results demonstrate that the tool achieves the Good standard on hedonic and pragmatic quality standards, is easy to use and meets the needs of the annotating the corpus for supervised classifiers. Apart from assisting researchers in crowdsourcing a labelled dataset, the tool has practical applicability in software engineering education and assists in building expert ratings for software artefacts.
翻译:在构建监督学习模型时,我们需要带有标注的样本来建立语料库并训练机器学习模型。然而,大多数研究都是手动构建标注数据集,这在许多情况下是一项艰巨的任务。为解决这一问题,我们开发了一个名为CodeLabeller的在线工具。CodeLabeller是一种基于Web的工具,旨在通过全程改进数据收集过程,提供一种高效处理大规模源代码文件标注流程的方法,以支持监督学习方法。我们通过构建一个包含从大量开源Java项目中获取的千余个源文件的语料库,并对每个Java源文件标注其相应的设计模式和摘要,对CodeLabeller进行了测试。二十五位软件工程领域的专家参与了该工具的用户体验评估,使用标准用户体验问卷在线调查。调查结果表明,该工具在享乐质量和实用质量标准上达到"良好"水平,易于使用,并满足为监督分类器标注语料库的需求。除了帮助研究人员众包构建标注数据集外,该工具在软件工程教育中具有实际应用价值,并能辅助构建软件制品专家评分。