Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.
翻译:大型语言模型(LLM)在复杂推理任务上实现了最先进的性能,但其推理成本限制了大规模部署。小型语言模型(SLM)能显著节省成本,但在准确性上存在明显差距。现有方法——路由与级联——将LLM视为全有或全无的资源:查询要么完全绕过LLM,要么以全额成本由LLM生成完整响应。我们提出了LLM引导框架,该框架仅向LLM请求一个简短前缀(即提示),并将其提供给SLM。这一简单机制在数学和编程任务中表现出惊人的有效性:即使提示仅占完整LLM响应的10-30%,也能显著提升SLM的准确性。引导方法泛化了路由与级联策略,并在理想决策下实现了更低的成本。我们开发了一个两阶段预测器,用于联合判断是否需要提示以及需要请求多少词元。在广泛使用的数学推理(GSM8K、CNK12)和代码生成(HumanEval、MBPP)基准测试中,引导方法相较于纯LLM推理将成本降低了42-94%。与最先进的路由和级联基线方法相比,引导在保持相同准确性的同时实现了高达2.8倍的成本降低。据我们所知,这是首个利用词元级预算控制实现SLM-LLM协同的工作。