Large language models (LLMs) have shown impressive capabilities across diverse settings, but still struggle as the length and complexity of the context increases. To address this challenge, we propose Thinking Recursively and Dynamically (ThReaD). THREAD frames model generation as a thread of execution that, based on the context, can run to completion or dynamically spawn new threads. By spawning, threads can offload work (e.g., thinking, retrieving information) to child threads, which only return tokens needed for the parent thread to do its work. In effect, this enables the model to adapt, as needed, the amount of intermediate work used to produce tokens. We apply THREAD in the settings of LLM task solving and question answering, where the dynamic threading allows the model to recursively decompose the given task or question into progressively simpler sub-problems that can be solved by separate child threads. We test THREAD, implemented using a few-shot learning approach, on diverse benchmarks for agent tasks and data-grounded question answering. THREAD achieves state-of-the-art performance with GPT-4 and GPT-3.5 on these benchmarks, including ALFWorld, TextCraft, and WebShop, along with two new benchmarks, DataCommons QA and MIMIC-III ICU QA. In addition, THREAD outperforms existing frameworks by 10% to 50% absolute points with smaller models, including Llama-3-8b and CodeLlama-7b.
翻译:大型语言模型(LLM)已在多样化场景中展现出卓越能力,但随着上下文长度和复杂度的增加,其性能仍面临挑战。为应对这一挑战,我们提出递归动态思考框架(ThReaD)。THREAD将模型生成过程构建为执行线程,该线程可根据上下文选择运行至完成或动态生成新线程。通过生成子线程,父线程可将特定工作(如思考、信息检索)卸载至子线程,子线程仅返回父线程完成任务所需的关键标记。这实质上使模型能够根据需求动态调整生成标记所需的中间工作量。我们将THREAD应用于LLM任务求解与问答场景,其动态线程机制使模型能够递归地将给定任务或问题分解为逐步简化的子问题,并由独立的子线程处理。我们采用小样本学习方法实现THREAD,并在智能体任务与数据驱动问答的多样化基准测试中进行评估。THREAD在GPT-4与GPT-3.5模型上实现了包括ALFWorld、TextCraft、WebShop以及两个新基准(DataCommons QA与MIMIC-III ICU QA)在内的最优性能。此外,在Llama-3-8b与CodeLlama-7b等较小模型上,THREAD相较现有框架取得10%至50%的绝对性能提升。