This paper presents $\forall$uto$\exists$val, a new approach for scaling LLM assessment in translating formal syntax -- such as first-order logic, regular expressions, etc -- to natural language (interpretation) or vice versa (compilation), thereby facilitating their use in applications such as generating/explaining logic and control flow for programs etc. Existing approaches for LLM assessment in these areas require labor-intensive ground-truth creation, the availability of which undermines the separation of training and test sets. Furthermore, such datasets typically include relatively few hand-coded test cases over which LLM accuracy is determined, thus making them inadequate for determining the safety or correctness of their generated outputs. We introduce a new approach that utilizes context-free grammars (CFGs) to generate out-of-distribution datasets on the fly and perform closed-loop testing of LLM capabilities using formal verifiers to guarantee the correctness of LLM outputs without any human intervention. We release our dataset and benchmark as open-source code at \url{https://github.com/AAIR-lab/auto-llm-assessment}. We also conduct an assessment of several SOTA closed and open-source LLMs to showcase the feasibility and scalability of this paradigm. Our experiments reveal that SOTA LLMs are unable to solve the formal translation task adequately.
翻译:本文提出∀uto∃val,一种用于扩展大语言模型在形式语法(如一阶逻辑、正则表达式等)与自然语言之间双向转换评估的新方法:将形式语法转换为自然语言(解释)或反之(编译),从而促进其在生成/解释程序逻辑与控制流等应用中的使用。现有的大语言模型在这些领域的评估方法需要大量人工标注的真实数据,而此类数据的可获得性会削弱训练集与测试集的分离。此外,此类数据集通常仅包含相对少量的人工编码测试用例来判定大语言模型的准确性,因此不足以判定其生成输出的安全性或正确性。我们引入一种新方法,利用上下文无关文法动态生成分布外数据集,并借助形式验证器对大语言模型能力进行闭环测试,从而在无需人工干预的情况下保证大语言模型输出的正确性。我们在\url{https://github.com/AAIR-lab/auto-llm-assessment}以开源代码形式发布了数据集与基准测试。我们还对多个先进闭源与开源大语言模型进行了评估,以展示该范式的可行性与可扩展性。实验结果表明,当前最先进的大语言模型尚无法充分解决形式翻译任务。