UTF-16 is a widely used Unicode encoding representing characters with one or two 16-bit code units. The format relies on surrogate pairs to encode characters beyond the Basic Multilingual Plane, requiring a high surrogate followed by a low surrogate. Ill-formed UTF-16 strings -- where surrogates are mismatched -- can arise from data corruption or improper encoding, posing security and reliability risks. Consequently, programming languages such as JavaScript include functions to fix ill-formed UTF-16 strings by replacing mismatched surrogates with the Unicode replacement character (U+FFFD). We propose using Single Instruction, Multiple Data (SIMD) instructions to handle multiple code units in parallel, enabling faster and more efficient execution. Our software is part of the Google JavaScript engine (V8) and thus part of several major Web browsers.
翻译:UTF-16是一种广泛使用的Unicode编码,它使用一个或两个16位码元来表示字符。该编码格式依赖代理对来编码基本多文种平面之外的字符,要求高代理项后必须紧跟低代理项。格式错误的UTF-16字符串——即代理项不匹配的情况——可能因数据损坏或编码不当而产生,从而引发安全性和可靠性风险。因此,JavaScript等编程语言内置了修复格式错误UTF-16字符串的函数,其方法是将不匹配的代理项替换为Unicode替换字符(U+FFFD)。我们提出利用单指令多数据(SIMD)指令并行处理多个码元,从而实现更快速高效的执行。我们的软件已集成至Google JavaScript引擎(V8)中,因而成为多个主流网页浏览器的组成部分。