Although information theory has found success in disciplines, the literature on its applications to software evolution is limit. We are still missing artifacts that leverage the data and tooling available to measure how the information content of a project can be a proxy for its complexity. In this work, we explore two definitions of entropy, one structural and one textual, and apply it to the historical progression of the commit history of 25 open source projects. We produce evidence that they generally are highly correlated. We also observed that they display weak and unstable correlations with other complexity metrics. Our preliminary investigation of outliers shows an unexpected high frequency of events where there is considerable change in the information content of the project, suggesting that such outliers may inform a definition of surprisal.
翻译:尽管信息论已在多个学科领域取得成功,但关于其在软件演化中应用的文献仍然有限。我们仍缺少利用现有数据与工具、通过度量项目信息内容作为其复杂度代理的成果。本研究探索了两种熵的定义——结构熵与文本熵,并将其应用于25个开源项目的提交历史演变过程。研究发现这两种熵通常具有高度相关性。同时,它们与其他复杂度指标之间呈现出微弱且不稳定的相关性。对异常值的初步分析显示,在项目信息内容发生显著变化的事件中存在出乎意料的高频性,表明此类异常值可能为"惊异度"定义提供参考。