Estimating the entropy rate of discrete time series is a challenging problem with important applications in numerous areas including neuroscience, genomics, image processing and natural language processing. A number of approaches have been developed for this task, typically based either on universal data compression algorithms, or on statistical estimators of the underlying process distribution. In this work, we propose a fully-Bayesian approach for entropy estimation. Building on the recently introduced Bayesian Context Trees (BCT) framework for modelling discrete time series as variable-memory Markov chains, we show that it is possible to sample directly from the induced posterior on the entropy rate. This can be used to estimate the entire posterior distribution, providing much richer information than point estimates. We develop theoretical results for the posterior distribution of the entropy rate, including proofs of consistency and asymptotic normality. The practical utility of the method is illustrated on both simulated and real-world data, where it is found to outperform state-of-the-art alternatives.
翻译:估计离散时间序列的熵率是一个具有挑战性的问题,在神经科学、基因组学、图像处理和自然语言处理等多个领域有重要应用。针对此任务已发展出多种方法,通常基于通用数据压缩算法或对底层过程分布的统计估计器。在本研究中,我们提出了一种完全贝叶斯方法用于熵估计。基于最近引入的贝叶斯上下文树(BCT)框架——该框架将离散时间序列建模为可变记忆马尔可夫链——我们证明可以直接从熵率的诱导后验分布中进行采样。这可用于估计整个后验分布,从而提供比点估计丰富得多的信息。我们发展了熵率后验分布的理论结果,包括一致性和渐近正态性的证明。该方法在模拟数据和真实数据上的实用价值得到了验证,结果表明其优于当前最先进的替代方法。