Despite a growing body of work at the intersection of deep learning and formal languages, there has been relatively little systematic exploration of transformer models for reasoning about typed lambda calculi. This is an interesting area of inquiry for two reasons. First, typed lambda calculi are the lingua franc of programming languages. A set of heuristics that relate various typed lambda calculi to effective neural architectures would provide a systematic method for mapping language features (e.g., polymorphism, subtyping, inheritance, etc.) to architecture choices. Second, transformer models are widely used in deep learning architectures applied to code, but the design and hyperparameter space for them is large and relatively unexplored in programming language applications. Therefore, we suggest a benchmark that allows us to explore exactly this through perhaps the simplest and most fundamental property of a programming language: the relationship between terms and types. Consequently, we begin this inquiry of transformer architectures for typed lambda calculi by exploring the effect of transformer warm-up and optimizer selection in the task of type inference: i.e., predicting the types of lambda calculus terms using only transformers. We find that the optimization landscape is difficult even in this simple setting. One particular experimental finding is that optimization by Adafactor converges much faster compared to the optimization by Adam and RAdam. We conjecture that such different performance of optimizers might be related to the difficulties of generalization over formally generated dataset.
翻译:尽管深度学习和形式语言的交叉研究日益增多,但针对类型化λ演算推理的Transformer模型系统探索仍相对有限。这一研究领域具有双重意义:首先,类型化λ演算是编程语言的通用语言。建立将不同类型化λ演算与有效神经架构相关联的启发式方法,可为语言特性(如多态性、子类型化、继承等)到架构选择的映射提供系统方法论。其次,Transformer模型广泛应用于代码相关的深度学习架构中,但其设计与超参数空间在编程语言应用中仍存在大量未探索领域。因此,我们提出一个基准测试,通过编程语言最基本属性——项与类型的关系来精确探索该问题。我们以类型推断任务(即仅用Transformer预测λ演算项的类别)为切入点,考察Transformer预热策略与优化器选择的影响。实验表明,即使在如此简单的设定下,优化过程仍充满挑战。其中一项关键发现是:Adafactor优化器的收敛速度显著快于Adam和RAdam。我们推测,不同优化器性能差异可能与形式化生成数据集上的泛化困难有关。