Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption. In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation and compare it with standard autoregressive sampling. This can be used together with model-based optimizations (e.g. quantization) to provide an optimized solution. Both sampling methods make use of KV caching. A Jupyter notebook and some sample executions are provided.
翻译:[translated abstract in Chinese]
推理优化对于改善用户体验、降低基础设施成本和能耗至关重要。本文阐述了一种称为推测性采样的动态执行方式,以减少文本生成的整体延迟,并将其与标准自回归采样进行了比较。该技术可与基于模型的优化(例如量化)结合使用,从而提供优化解决方案。两种采样方法均利用了KV缓存。文中附带了Jupyter notebook及示例运行结果。