Although information theory has found success in disciplines, the literature on its applications to software evolution is limit. We are still missing artifacts that leverage the data and tooling available to measure how the information content of a project can be a proxy for its complexity. In this work, we explore two definitions of entropy, one structural and one textual, and apply it to the historical progression of the commit history of 25 open source projects. We produce evidence that they generally are highly correlated. We also observed that they display weak and unstable correlations with other complexity metrics. Our preliminary investigation of outliers shows an unexpected high frequency of events where there is considerable change in the information content of the project, suggesting that such outliers may inform a definition of surprisal.
翻译:尽管信息论在其他学科中已取得成功,但其在软件演化中的应用文献仍较为有限。我们仍然缺乏利用现有数据和工具来测量项目信息内容作为其复杂性代理指标的成果。本研究探索了两种熵的定义(结构熵和文本熵),并将其应用于25个开源项目提交历史的时间演进分析。研究证据表明这两种熵通常呈现高度相关性。同时我们也观察到它们与其他复杂性指标之间存在微弱且不稳定的相关性。对异常值的初步调查显示,项目信息内容发生显著变化的事件出现频率异常之高,表明此类异常值可能有助于定义"意外度"这一概念。