Large language models typically represent Chinese characters as discrete index-based tokens, largely ignoring their visual form. For logographic scripts, visual structure carries semantic and phonetic information, which may aid prediction. We investigate whether low-resolution visual inputs can serve as an alternative for character-level modeling. Instead of token IDs, our decoder receives grayscale images of individual characters, with resolutions as low as $8 \times 8$ pixels. Remarkably, these inputs achieve 39.2\% accuracy, comparable to the index-based baseline of 39.1\%. Such low-resource settings also exhibit a pronounced \emph{hot-start} effect: by 0.4\% of total training, accuracy reaches above 12\%, while index-based models lag at below 6\%. Overall, our results demonstrate that minimal visual structure can provide a robust and efficient signal for Chinese language modeling, offering an alternative perspective on character representation that complements traditional index-based approaches.
翻译:大型语言模型通常将汉字表示为基于离散索引的标记,很大程度上忽略了其视觉形态。对于表意文字而言,视觉结构承载着语义与语音信息,可能有助于预测任务。本研究探讨低分辨率视觉输入能否作为字符级建模的替代方案。我们的解码器接收单个字符的灰度图像而非标记ID,其分辨率可低至$8 \times 8$像素。值得注意的是,此类输入实现了39.2%的准确率,与基于索引的基线模型39.1%的结果相当。在此低资源配置下还观察到显著的\emph{热启动}效应:仅需总训练量的0.4%,准确率即可突破12%,而基于索引的模型此时仍低于6%。总体而言,我们的研究结果表明,极简的视觉结构能够为中文语言建模提供稳健高效的信号,这为字符表征提供了与传统索引方法形成互补的新视角。