This study investigates whether Compressed-Language Models (CLMs), i.e. language models operating on raw byte streams from Compressed File Formats~(CFFs), can understand files compressed by CFFs. We focus on the JPEG format as a representative CFF, given its commonality and its representativeness of key concepts in compression, such as entropy coding and run-length encoding. We test if CLMs understand the JPEG format by probing their capabilities to perform along three axes: recognition of inherent file properties, handling of files with anomalies, and generation of new files. Our findings demonstrate that CLMs can effectively perform these tasks. These results suggest that CLMs can understand the semantics of compressed data when directly operating on the byte streams of files produced by CFFs. The possibility to directly operate on raw compressed files offers the promise to leverage some of their remarkable characteristics, such as their ubiquity, compactness, multi-modality and segment-nature.
翻译:本研究探讨了压缩语言模型(CLMs),即直接在压缩文件格式(CFFs)原始字节流上运行的语言模型,是否能够理解由CFFs压缩的文件。我们以JPEG格式作为代表性CFF进行研究,因其普遍性及其对熵编码、游程编码等压缩关键概念的代表性。我们通过测试CLMs在三个维度上的能力来检验其是否理解JPEG格式:固有文件属性识别、异常文件处理以及新文件生成。研究结果表明,CLMs能有效完成这些任务。这些发现表明,当直接在CFFs生成的文件字节流上运行时,CLMs能够理解压缩数据的语义。直接操作原始压缩文件的可能性为利用其显著特性(如普适性、紧凑性、多模态性和分段特性)提供了前景。