We present Midicoth, a lossless compression system that introduces a micro-diffusion denoising layer for improving probability estimates produced by adaptive statistical models. In compressors such as Prediction by Partial Matching (PPM), probability estimates are smoothed by a prior to handle sparse observations. When contexts have been seen only a few times, this prior dominates the prediction and produces distributions that are significantly flatter than the true source distribution, leading to compression inefficiency. Midicoth addresses this limitation by treating prior smoothing as a shrinkage process and applying a reverse denoising step that corrects predicted probabilities using empirical calibration statistics. To make this correction data-efficient, the method decomposes each byte prediction into a hierarchy of binary decisions along a bitwise tree. This converts a single 256-way calibration problem into a sequence of binary calibration tasks, enabling reliable estimation of correction terms from relatively small numbers of observations. The denoising process is applied in multiple successive steps, allowing each stage to refine residual prediction errors left by the previous one. The micro-diffusion layer operates as a lightweight post-blend calibration stage applied after all model predictions have been combined, allowing it to correct systematic biases in the final probability distribution. Midicoth combines five fully online components: an adaptive PPM model, a long-range match model, a trie-based word model, a high-order context model, and the micro-diffusion denoiser applied as the final stage.
翻译:本文提出Midicoth,一种无损压缩系统,通过引入微扩散去噪层来改进自适应统计模型生成的概率估计。在诸如部分匹配预测(PPM)等压缩器中,概率估计通过先验进行平滑以处理稀疏观测。当上下文仅出现少数几次时,该先验主导预测并产生比真实信源分布显著平坦的分布,导致压缩效率低下。Midicoth通过将先验平滑视为收缩过程,并应用基于经验校准统计的反向去噪步骤来校正预测概率,从而解决此限制。为使校正过程数据高效,该方法将每个字节预测分解为沿比特树的二进制决策层次结构。这将单个256路校准问题转化为一系列二元校准任务,使得能够从相对较少的观测中可靠估计校正项。去噪过程以多步连续方式应用,允许每个阶段细化前一步遗留的残差预测误差。微扩散层作为轻量级后融合校准阶段,在所有模型预测组合后应用,使其能够校正最终概率分布中的系统性偏差。Midicoth整合了五个完全在线组件:自适应PPM模型、长程匹配模型、基于字典树的词模型、高阶上下文模型,以及作为最终阶段应用的微扩散去噪器。