This paper investigates the employment of various encoders in text transformation, converting characters into bytes. It discusses local encoders such as ASCII and GB-2312, which encode specific characters into shorter bytes, and universal encoders like UTF-8 and UTF-16, which can encode the complete Unicode set with greater space requirements and are gaining widespread acceptance. Other encoders, including SCSU, BOCU-1, and binary encoders, however, lack self-synchronizing capabilities. Duncode is introduced as an innovative encoding method that aims to encode the entire Unicode character set with high space efficiency, akin to local encoders. It has the potential to compress multiple characters of a string into a Duncode unit using fewer bytes. Despite offering less self-synchronizing identification information, Duncode surpasses UTF8 in terms of space efficiency. The application is available at \url{https://github.com/laohur/duncode}. Additionally, we have developed a benchmark for evaluating character encoders across different languages. It encompasses 179 languages and can be accessed at \url{https://github.com/laohur/wiki2txt}.
翻译:本文研究了在文本转换中使用各种编码器将字符转换为字节的方法。讨论了本地编码器(如ASCII和GB-2312),它们将特定字符编码为更短的字节,以及通用编码器(如UTF-8和UTF-16),后者能够编码完整的Unicode字符集,但占用更多空间且正获得广泛采用。其他编码器(包括SCSU、BOCU-1和二进制编码器)缺乏自同步能力。Duncode作为一种创新编码方法被引入,旨在以类似本地编码器的高空间效率编码整个Unicode字符集,其潜在能力是用更少的字节将字符串中的多个字符压缩成一个Duncode单元。尽管提供的自同步识别信息较少,但Duncode在空间效率上超越了UTF-8。该应用程序可在\url{https://github.com/laohur/duncode}获取。此外,我们还开发了一个用于评估不同语言字符编码器的基准测试,涵盖179种语言,可通过\url{https://github.com/laohur/wiki2txt}访问。