LLM pipelines waste substantial token budgets on low-information content: repeated context, verbose responses, and redundant boilerplate. We introduce Entropy Gate, a token compression framework applying entropy quenching $-$ a thermodynamic process that progressively freezes out low-energy tokens while preserving semantic fidelity. Each token receives a multi-factor information energy $E(t)$ combining statistical, structural, and positional components. An adaptive quenching schedule $T(τ) = T_0 / (1 + ατ)$ removes tokens whose Boltzmann survival probability $p_i = \exp(-E_i / kT)$ falls below threshold, with a fidelity gate halting compression when energy-weighted similarity drops below $θ$. We prove token selection by descending $E(t)$ maximizes expected semantic preservation, that quenching produces nested survival sets, and that achievable compression approaches the information-theoretic limit $\text{CR} \to 1 - I(P; T)/H(P)$. A Phase 1 heuristic achieves 40-60% compression across five prompt categories while maintaining $S_E > 0.80$, with energy-squared amplification $E \to E^2$ adding 10-25 percentage points. Context deduplication adds 50-70% savings on repeated blocks. Output-side quenching, motivated by findings that brevity improves accuracy, further reduces response overhead. Combined with external memory, reduction composes multiplicatively to 88-96% for agentic workloads. The framework is stateless, model-agnostic, and deploys as an OpenAI-compatible HTTP proxy.
翻译:大语言模型(LLM)流水线在低信息量内容(重复上下文、冗长回复及冗余模板)上浪费了大量令牌预算。本文提出熵门控(Entropy Gate)——一种应用熵淬灭(一种热力学过程,通过逐步冻结低能量令牌同时保持语义保真度)的令牌压缩框架。每个令牌获得一个结合统计、结构与位置分量的多因素信息能量E(t)。自适应淬灭调度T(τ) = T_0 / (1 + ατ)会剔除玻尔兹曼生存概率p_i = exp(-E_i / kT)低于阈值的令牌,保真门控则会在能量加权相似度降至θ以下时停止压缩。我们证明了按E(t)降序选择令牌可最大化预期语义保留,淬灭过程能产生嵌套生存集,且可达压缩率趋近信息论极限CR → 1 - I(P; T)/H(P)。第一阶段启发式方法在五种提示类别上实现了40%-60%的压缩率,同时保持S_E > 0.80,而能量平方放大E → E^2可额外提升10-25个百分点。上下文去重对重复块可节省50%-70%的令牌。受"简洁性提升准确性"这一发现的启发,输出端淬灭可进一步减少响应开销。结合外部存储器时,压缩率在代理型工作负载上呈现出乘性叠加的88%-96%综合压缩效果。该框架是无状态、模型无关的,并能以兼容OpenAI的HTTP代理形式部署。