This paper discusses and evaluates ideas of data balancing and data augmentation in the context of mathematical objects: an important topic for both the symbolic computation and satisfiability checking communities, when they are making use of machine learning techniques to optimise their tools. We consider a dataset of non-linear polynomial problems and the problem of selecting a variable ordering for cylindrical algebraic decomposition to tackle these with. By swapping the variable names in already labelled problems, we generate new problem instances that do not require any further labelling when viewing the selection as a classification problem. We find this augmentation increases the accuracy of ML models by 63% on average. We study what part of this improvement is due to the balancing of the dataset and what is achieved thanks to further increasing the size of the dataset, concluding that both have a very significant effect. We finish the paper by reflecting on how this idea could be applied in other uses of machine learning in mathematics.
翻译:本文讨论并评估了数学对象语境下的数据平衡与数据增强方法——这一主题对于符号计算领域和可满足性判定社区而言至关重要,尤其当两者利用机器学习技术优化其工具时。我们以非线性多项式问题数据集为基础,针对柱形代数分解中变量排序的选择问题展开研究。通过置换已标注问题中的变量名称,我们生成了新的问题实例——当将该选择问题视为分类任务时,这些新实例无需额外标注。实验表明,该增强方法平均使机器学习模型的准确率提升63%。我们进一步分析了准确率提升中数据平衡与数据集规模扩大各自的贡献比重,结论显示两者均具有显著影响。最后,本文思考了该思路如何推广至数学领域其他机器学习应用场景。