Resolving and rewriting references is fundamental in programming languages. Motivated by a real-world decompilation task, we abstract reference rewriting into the problems of direct and indirect indexing by permutation. We create synthetic benchmarks for these tasks and show that well-known sequence-to-sequence machine learning architectures are struggling on these benchmarks. We introduce new sequence-to-sequence architectures for both problems. Our measurements show that our architectures outperform the baselines in both robustness and scalability: our models can handle examples that are ten times longer compared to the best baseline. We measure the impact of our architecture in the real-world task of decompiling switch statements, which has an indexing subtask. According to our measurements, the extended model decreases the error rate by 42%. Multiple ablation studies show that all components of our architectures are essential.
翻译:解析和重写引用是编程语言中的基础问题。受实际反编译任务的启发,我们将引用重写抽象为通过置换进行直接和间接索引的问题。我们为这些任务创建了合成基准,并发现知名的序列到序列机器学习架构在这些基准上表现不佳。我们针对这两个问题引入了新的序列到序列架构。实验测量表明,我们的架构在鲁棒性和可扩展性方面均优于基线模型:与最佳基线相比,我们的模型能处理长度达十倍的示例。我们测量了该架构在实际反编译switch语句任务中的效果(该任务包含索引子任务)。根据测量结果,扩展模型将错误率降低了42%。多项消融研究表明,我们架构的所有组件均不可或缺。