Artificial Intelligence has gained a lot of traction in the recent years, with machine learning notably starting to see more applications across a varied range of fields. One specific machine learning application that is of interest to us is that of software safety and security, especially in the context of parallel programs. The issue of being able to detect concurrency bugs automatically has intrigued programmers for a long time, as the added layer of complexity makes concurrent programs more prone to failure. The development of such automatic detection tools provides considerable benefits to programmers in terms of saving time while debugging, as well as reducing the number of unexpected bugs. We believe machine learning may help achieve this goal by providing additional advantages over current approaches, in terms of both overall tool accuracy as well as programming language flexibility. However, due to the presence of numerous challenges specific to the machine learning approach (correctly labelling a sufficiently large dataset, finding the best model types/architectures and so forth), we have to approach each issue of developing such a tool separately. Therefore, the focus of this project is on comparing both common and recent machine learning approaches. We abstract away the complexity of procuring a labelled dataset of concurrent programs under the form of a synthetic dataset that we define and generate with the scope of simulating real-life (concurrent) programs. We formulate hypotheses about fundamental limits of various machine learning model types which we then validate by running extensive tests on our synthetic dataset. We hope that our findings provide more insight in the advantages and disadvantages of various model types when modelling programs using machine learning, as well as any other related field (e.g. NLP).
翻译:近年来人工智能取得了长足进展,机器学习尤其开始在不同领域中获得更广泛的应用。我们关注的特定机器学习应用是软件安全性与可靠性,特别是在并行程序的背景下。由于并发程序在额外复杂性层面更容易出现故障,因此自动检测并发程序缺陷的能力问题长期以来一直引发程序员的研究兴趣。这种自动检测工具的开发可为程序员节省调试时间并减少意外缺陷数量,从而带来显著益处。我们相信机器学习能够在整体工具精度和编程语言灵活性方面提供超越现有方法的额外优势,从而有助于实现这一目标。然而,由于机器学习方法面临诸多特定挑战(如正确标注足够大规模的数据集、寻找最佳模型类型/架构等),我们需逐一解决开发此类工具时涉及的每个问题。因此,本项目的重点是比较常见及最新的机器学习方法。我们将获取已标注并发程序数据集的复杂性抽象为合成数据集形式——该数据集由我们定义并生成,旨在模拟真实(并发)程序。我们针对各类机器学习模型的基本局限性提出假设,并通过在合成数据集上开展大量测试进行验证。我们期望研究结果能为使用机器学习对程序建模(及其他相关领域如自然语言处理)提供关于不同模型类型利弊的深刻洞见。