Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.
翻译:数学推理技能对于通用智能系统执行从日常购物到气候建模等任务至关重要。为评估并提升人工智能系统在该领域的能力,我们提出了LILA,一个统一的数学推理基准,包含23项多样化任务,涵盖四个维度:(i)数学能力,如算术、微积分;(ii)语言格式,如问答、填空;(iii)语言多样性,如无语言、简单语言;(iv)外部知识,如常识、物理学。我们通过扩展20个数据集来构建基准,以Python程序形式收集任务指令和解决方案,从而在正确答案之外获得可解释的解决方案。我们还引入了两个评估数据集,用于测量分布外性能和对抗语言扰动的鲁棒性。最后,我们介绍了BHASKARA,一个在LILA上训练的通用数学推理模型。重要的是,我们发现多任务学习带来了显著改进(与单任务模型相比,F1分数平均相对提升21.83%),而表现最佳的模型仅达到60.40%,表明通用数学推理与理解仍有改进空间。