Programmers and researchers are increasingly developing surrogates of programs, models of a subset of the observable behavior of a given program, to solve a variety of software development challenges. Programmers train surrogates from measurements of the behavior of a program on a dataset of input examples. A key challenge of surrogate construction is determining what training data to use to train a surrogate of a given program. We present a methodology for sampling datasets to train neural-network-based surrogates of programs. We first characterize the proportion of data to sample from each region of a program's input space (corresponding to different execution paths of the program) based on the complexity of learning a surrogate of the corresponding execution path. We next provide a program analysis to determine the complexity of different paths in a program. We evaluate these results on a range of real-world programs, demonstrating that complexity-guided sampling results in empirical improvements in accuracy.
翻译:程序员和研究人员越来越多地开发程序的代理模型,即对给定程序可观测行为子集的建模,以解决各类软件开发挑战。程序员通过输入示例数据集上程序行为的测量结果来训练代理模型。代理模型构建的一个关键挑战是确定用于训练特定程序代理模型的训练数据。我们提出了一种基于数据采样的方法论,用于训练基于神经网络的程序代理模型。首先,根据学习对应执行路径代理模型的复杂度,我们确定了从程序输入空间不同区域(对应程序的不同执行路径)中采样数据的比例。接着,我们提供了一种程序分析方法来确定程序中不同路径的复杂度。我们在一系列真实世界的程序上评估了这些结果,证明基于复杂度引导的采样能够在准确性上实现实证改进。