Large language models (LLMs) have shown impressive results at a wide-range of tasks. However, they have limitations, such as hallucinating facts and struggling with arithmetic. Recent work has addressed these issues with sophisticated decoding techniques. However, performant decoding, particularly for sophisticated techniques, relies crucially on parallelization and batching, which are difficult for developers. We make two observations: 1) existing approaches are high-level domain-specific languages for gluing expensive black-box calls, but are not general or compositional; 2) LLM programs are essentially pure (all effects commute). Guided by these observations, we develop a novel, general-purpose lambda calculus for automatically parallelizing a wide-range of LLM interactions, without user intervention. The key difference versus standard lambda calculus is a novel "opportunistic" evaluation strategy, which steps independent parts of a program in parallel, dispatching black-box external calls as eagerly as possible, even while data-independent parts of the program are waiting for their own external calls to return. To maintain the simplicity of the language and to ensure uniformity of opportunistic evaluation, control-flow and looping constructs are implemented in-language, via Church encodings. We implement this approach in a framework called EPIC, embedded in--and interoperating closely with--Python. We demonstrate its versatility and performance with three case studies drawn from the machine learning literature: Tree-of-Thoughts (LLMs embedded in classic search procedures), nested tool use, and constrained decoding. Our experiments show that opportunistic evaluation offers a $1.5\times$ to $4.8\times$ speedup over sequential evaluation, while still allowing practitioners to write straightforward and composable programs, without any manual parallelism or batching.
翻译:大规模语言模型(LLMs)在广泛的任务中展现出令人瞩目的成果,但仍存在局限性,例如事实幻觉和算术运算困难。近期研究通过复杂解码技术试图解决这些问题,但高效解码(尤其对于复杂技术)关键依赖于并行化和批处理,这对开发者极具挑战。我们提出两点观察:1)现有方法多为胶合昂贵黑盒调用的高层领域特定语言,缺乏通用性与组合性;2)LLM程序本质上是纯函数式(所有效应可交换)。基于这些观察,我们开发了一种新型通用Lambda演算,能够无需用户干预自动并行化广泛LLM交互。与标准Lambda演算的关键区别在于一种新颖的“机会主义”求值策略:该策略并行执行程序中独立部分,尽可能积极地分派黑盒外部调用,即使数据无关部分正在等待其自身外部调用的返回。为保持语言简洁性并确保机会主义求值的统一性,控制流与循环结构通过Church编码在语言内部实现。我们将该方法实现为EPIC框架,内嵌于Python并与其紧密协作。通过来自机器学习文献的三个案例研究(树形思维:嵌入经典搜索过程的LLM、嵌套工具使用、约束解码),展示了其通用性与性能。实验表明,机会主义求值相较顺序求值可实现1.5倍至4.8倍加速,同时允许实践者编写直接且可组合的程序,无需任何手动并行化或批处理。