On Interpreting the Effectiveness of Unsupervised Software Traceability with Information Theory

Traceability is a cornerstone of modern software development, ensuring system reliability and facilitating software maintenance. While unsupervised techniques leveraging Information Retrieval (IR) and Machine Learning (ML) methods have been widely used for predicting trace links, their effectiveness remains underexplored. In particular, these techniques often assume traceability patterns are present within textual data - a premise that may not hold universally. Moreover, standard evaluation metrics such as precision, recall, accuracy, or F1 measure can misrepresent the model performance when underlying data distributions are not properly analyzed. Given that automated traceability techniques tend to struggle to establish links, we need further insight into the information limits related to traceability artifacts. In this paper, we propose an approach, TraceXplainer, for using information theory metrics to evaluate and better understand the performance (limits) of unsupervised traceability techniques. Specifically, we introduce self-information, cross-entropy, and mutual information (MI) as metrics to measure the informativeness and reliability of traceability links. Through a comprehensive replication and analysis of well-studied datasets and techniques, we investigate the effectiveness of unsupervised techniques that predict traceability links using IR/ML. This application of TraceXplainer illustrates an imbalance in typical traceability datasets where the source code has on average 1.48 more information bits (i.e., entropy) than the linked documentation. Additionally, we demonstrate that an average MI of 4.81 bits, loss of 1.75, and noise of 0.28 bits signify that there are information-theoretic limits on the effectiveness of unsupervised traceability techniques. We hope these findings spur additional research on understanding the limits and progress of traceability research.

翻译：可追踪性作为现代软件开发的基石，在确保系统可靠性和促进软件维护方面具有关键作用。尽管利用信息检索（IR）和机器学习（ML）方法的无监督技术已广泛用于预测追踪链接，但其有效性仍未得到充分探究。特别地，这些技术通常假设可追踪性模式存在于文本数据中——这一前提可能并不普遍成立。此外，当底层数据分布未被恰当分析时，精确率、召回率、准确率或F1值等标准评估指标可能无法真实反映模型性能。鉴于自动化可追踪性技术在建立链接方面往往面临困难，我们需要进一步洞察与可追踪性制品相关的信息极限。本文提出一种名为TraceXplainer的方法，利用信息论指标来评估并更好地理解无监督可追踪性技术的性能（极限）。具体而言，我们引入自信息、交叉熵和互信息（MI）作为度量指标，以衡量可追踪性链接的信息量和可靠性。通过对已深入研究的数据库和技术进行全面复现与分析，我们探究了使用IR/ML预测可追踪性链接的无监督技术的有效性。TraceXplainer的应用揭示了典型可追踪性数据集中存在的不平衡现象：源代码平均比关联文档多出1.48比特信息（即熵）。此外，我们证明平均4.81比特的互信息、1.75比特的信息损失以及0.28比特的噪声，共同表明无监督可追踪性技术的有效性存在信息理论层面的极限。我们希望这些发现能够推动关于理解可追踪性研究极限与进展的进一步探索。