A considerable part of the performance of today's large language models (LLM's) and multimodal large language models (MLLM's) depends on their tokenization strategies. While tokenizers are extensively researched for textual and visual input, there is no research on tokenization strategies for gaze data due to its nature. However, a corresponding tokenization strategy would allow using the vision capabilities of pre-trained MLLM's for gaze data, for example, through fine-tuning. In this paper, we aim to close this research gap by analyzing five different tokenizers for gaze data on three different datasets for the forecasting and generation of gaze data through LLMs (cf.~\cref{fig:teaser}). We evaluate the tokenizers regarding their reconstruction and compression abilities. Further, we train an LLM for each tokenization strategy, measuring its generative and predictive performance. Overall, we found that a quantile tokenizer outperforms all others in predicting the gaze positions and k-means is best when predicting gaze velocities.
翻译:当今大型语言模型(LLM)和多模态大型语言模型(MLLM)的性能在很大程度上依赖于其标记化策略。尽管针对文本和视觉输入的标记器已得到广泛研究,但由于凝视数据本身的特性,目前尚无针对凝视数据标记化策略的研究。然而,相应的标记化策略将能够利用预训练MLLM的视觉能力处理凝视数据,例如通过微调实现。本文旨在填补这一研究空白,通过在三个不同数据集上分析五种不同的凝视数据标记器,用于基于LLM的凝视数据预测与生成(参见图1)。我们从重建能力和压缩能力两方面评估这些标记器。此外,我们针对每种标记化策略训练了一个LLM,并测量其生成与预测性能。总体而言,我们发现分位数标记器在预测凝视位置方面优于其他所有方法,而k均值标记器在预测凝视速度方面表现最佳。