Most machine learning and data analytics applications, including performance engineering in software systems, require a large number of annotations and labelled data, which might not be available in advance. Acquiring annotations often requires significant time, effort, and computational resources, making it challenging. We develop a unified active learning framework specializing in software performance prediction to address this task. We begin by parsing the source code to an Abstract Syntax Tree (AST) and augmenting it with data and control flow edges. Then, we convert the tree representation of the source code to a Flow Augmented-AST graph (FA-AST) representation. Based on the graph representation, we construct various graph embeddings (unsupervised and supervised) into a latent space. Given such an embedding, the framework becomes task agnostic since active learning can be performed using any regression method and query strategy suited for regression. Within this framework, we investigate the impact of using different levels of information for active and passive learning, e.g., partially available labels and unlabeled test data. Our approach aims to improve the investment in AI models for different software performance predictions (execution time) based on the structure of the source code. Our real-world experiments reveal that respectable performance can be achieved by querying labels for only a small subset of all the data.
翻译:大多数机器学习与数据分析应用(包括软件系统中的性能工程)都需要大量标注数据和标签,而这些数据可能无法预先获取。获取标注通常需要大量时间、精力和计算资源,因此具有挑战性。我们针对这一任务,开发了一个专注于软件性能预测的统一主动学习框架。首先,我们将源代码解析为抽象语法树(AST),并为其补充数据流和控制流边。然后,将源代码的树表示转换为流增强AST图(FA-AST)表示。基于该图表示,我们在潜空间中构建多种图嵌入(无监督和有监督)。借助此类嵌入,该框架变得与任务无关,因为任何适合回归的回归方法和查询策略均可用于执行主动学习。在此框架内,我们研究了在主动和被动学习中使用不同信息级别(例如部分可用标签和未标注测试数据)的影响。我们的方法旨在根据源代码结构,提升用于不同软件性能预测(执行时间)的AI模型投入效率。实际实验表明,仅需查询全部数据中一小部分的标签即可获得可观的性能表现。