We view large language models (LLMs) as stochastic \emph{language layers} in a network, where the learnable parameters are the natural language \emph{prompts} at each layer. We stack two such layers, feeding the output of one layer to the next. We call the stacked architecture a \emph{Deep Language Network} (DLN). We first show how to effectively perform prompt optimization for a 1-Layer language network (DLN-1). We then show how to train 2-layer DLNs (DLN-2), where two prompts must be learnt. We consider the output of the first layer as a latent variable to marginalize, and devise a variational inference algorithm for joint prompt training. A DLN-2 reaches higher performance than a single layer, sometimes comparable to few-shot GPT-4 even when each LLM in the network is smaller and less powerful. The DLN code is open source: https://github.com/microsoft/deep-language-networks .
翻译:我们将大型语言模型视为网络中的随机语言层,其中可学习参数为每层的自然语言提示。我们堆叠两个这样的层,将一层的输出馈送到下一层。我们将这种堆叠架构称为深度语言网络。我们首先展示如何对单层语言网络有效进行提示优化,然后展示如何训练需学习两个提示的双层深度语言网络。我们将第一层的输出视为需边缘化的潜在变量,并设计了一种变分推断算法用于联合提示训练。在每层网络中的语言模型规模较小且能力较弱的情况下,双层深度语言网络的性能仍优于单层,有时甚至可比肩少样本GPT-4。深度语言网络代码已开源:https://github.com/microsoft/deep-language-networks。