Context: Developers spend most of their time comprehending source code during software development. Automatically assessing how readable and understandable source code is can provide various benefits in different tasks, such as task triaging and code reviews. While several studies have proposed approaches to predict software readability and understandability, most of them only focus on local characteristics of source code. Besides, the performance of understandability prediction is far from satisfactory. Objective: In this study, we aim to assess readability and understandability from the perspective of language acquisition. More specifically, we would like to investigate whether code readability and understandability are correlated with the naturalness and vocabulary difficulty of source code. Method: To assess code naturalness, we adopted the cross-entropy metric, while we use a manually crafted list of code elements with their assigned advancement levels to assess the vocabulary difficulty. We will conduct a statistical analysis to understand their correlations and analyze whether code naturalness and vocabulary difficulty can be used to improve the performance of code readability and understandability prediction methods. The study will be conducted on existing datasets.
翻译:背景:开发者在软件开发过程中大部分时间用于理解源代码。自动评估源代码的可读性与可理解性可在任务分派和代码审查等不同任务中带来诸多益处。尽管已有研究提出了预测软件可读性和可理解性的方法,但大多数方法仅关注源代码的局部特征。此外,可理解性预测的性能仍远未达到令人满意的水平。目标:本研究旨在从语言习得视角评估可读性与可理解性。具体而言,我们将探究代码可读性和可理解性是否与源代码的自然性及词汇难度相关。方法:为评估代码自然性,我们采用交叉熵度量指标;同时使用人工编制的包含进阶等级的代码元素列表来评估词汇难度。我们将进行统计分析以理解其相关性,并分析代码自然性和词汇难度能否用于提升代码可读性与可理解性预测方法的性能。本研究将在现有数据集上开展。