The traditional methods for data compression are typically based on the symbol-level statistics, with the information source modeled as a long sequence of i.i.d. random variables or a stochastic process, thus establishing the fundamental limit as entropy for lossless compression and as mutual information for lossy compression. However, the source (including text, music, and speech) in the real world is often statistically ill-defined because of its close connection to human perception, and thus the model-driven approach can be quite suboptimal. This study places careful emphasis on English text and exploits its semantic aspect to enhance the compression efficiency further. The main idea stems from the puzzle crossword, observing that the hidden words can still be precisely reconstructed so long as some key letters are provided. The proposed masking-based strategy resembles the above game. In a nutshell, the encoder evaluates the semantic importance of each word according to the semantic loss and then masks the minor ones, while the decoder aims to recover the masked words from the semantic context by means of the Transformer. Our experiments show that the proposed semantic approach can achieve much higher compression efficiency than the traditional methods such as Huffman code and UTF-8 code, while preserving the meaning in the target text to a great extent.
翻译:传统数据压缩方法通常基于符号级统计特性,将信息源建模为独立同分布随机变量或随机过程的长序列,从而将无损压缩的极限确立为熵,将有损压缩的极限确立为互信息。然而,现实世界中的信源(包括文本、音乐和语音)由于与人类感知密切相关,往往在统计上难以明确定义,因此基于模型的方法可能相当次优。本研究聚焦于英文文本,利用其语义特征进一步提升压缩效率。核心思想源于填字游戏:观察到只要提供若干关键字母,隐藏单词仍可精确重构。本文提出的基于掩码的策略与此游戏类似。简言之,编码器根据语义损失评估每个单词的语义重要性,然后掩码次要单词,而解码器旨在通过Transformer从语义上下文中恢复被掩码的单词。实验表明,本文提出的语义方法相比霍夫曼编码和UTF-8编码等传统方法,能实现更高的压缩效率,同时在很大程度上保留目标文本的语义。