Large language models typically represent Chinese characters as discrete index-based tokens, largely ignoring their visual form. For logographic scripts, visual structure carries semantic and phonetic information, which may aid prediction. We investigate whether low-resolution visual inputs can serve as an alternative for character-level modeling. Instead of token IDs, our decoder receives grayscale images of individual characters, with resolutions as low as 8 x 8 pixels. Remarkably, these inputs achieve 39.2% accuracy, comparable to the index-based baseline of 39.1%. Such low-resource settings also exhibit a pronounced hot-start effect: by 0.4% of total training, accuracy reaches above 12%, while index-based models lag at below 6%. Overall, our results demonstrate that minimal visual structure can provide a robust and efficient signal for Chinese language modeling, offering an alternative perspective on character representation that complements traditional index-based approaches.
翻译:大型语言模型通常将汉字表示为基于离散索引的标记,很大程度上忽略了其视觉形态。对于表意文字而言,视觉结构承载着语义和语音信息,可能有助于预测。我们研究了低分辨率视觉输入是否可作为字符级建模的替代方案。我们的解码器接收单个字符的灰度图像而非标记ID,其分辨率可低至8×8像素。值得注意的是,这些输入实现了39.2%的准确率,与基于索引的基线模型39.1%的结果相当。这种低资源设置还表现出显著的热启动效应:在完成总训练量的0.4%时,准确率已达到12%以上,而基于索引的模型此时仍低于6%。总体而言,我们的结果表明,极简的视觉结构能够为中文语言建模提供鲁棒且高效的信号,为字符表征提供了与传统索引方法形成互补的新视角。