Eye-tracking-while-reading corpora are a valuable resource for many different disciplines and use cases. Use cases range from studying the cognitive processes underlying reading to machine-learning-based applications, such as gaze-based assessments of reading comprehension. The past decades have seen an increase in the number and size of eye-tracking-while-reading datasets as well as increasing diversity with regard to the stimulus languages covered, the linguistic background of the participants, or accompanying psychometric or demographic data. The spread of data across different disciplines and the lack of data sharing standards across the communities lead to many existing datasets that cannot be easily reused due to a lack of interoperability. In this work, we aim at creating more transparency and clarity with regards to existing datasets and their features across different disciplines by i) presenting an extensive overview of existing datasets, ii) simplifying the sharing of newly created datasets by publishing a living overview online, https://dili-lab.github.io/datasets.html, presenting over 45 features for each dataset, and iii) integrating all publicly available datasets into the Python package pymovements which offers an eye-tracking datasets library. By doing so, we aim to strengthen the FAIR principles in eye-tracking-while-reading research and promote good scientific practices, such as reproducing and replicating studies.
翻译:阅读时眼动追踪语料库是众多不同学科和应用场景的宝贵资源。其应用范围涵盖从研究阅读背后的认知过程到基于机器学习的应用(例如基于注视的阅读理解评估)。过去数十年间,阅读时眼动追踪数据集的数量与规模持续增长,且在刺激语言覆盖范围、参与者语言背景、伴随的心理测量或人口统计学数据等方面呈现出日益增长的多样性。数据分散于不同学科领域及各社区间缺乏数据共享标准,导致许多现有数据集因互操作性不足而难以被重复使用。本研究旨在通过以下途径提升跨学科现有数据集及其特征的透明度与清晰度:i) 呈现现有数据集的全面概览;ii) 通过发布在线动态概览(https://dili-lab.github.io/datasets.html)简化新建数据集的共享流程,该概览为每个数据集呈现超过45项特征;iii) 将所有公开可用数据集整合至提供眼动追踪数据集库的Python软件包pymovements中。藉此,我们致力于强化阅读时眼动追踪研究中的FAIR原则,并推动重复与复制研究等良好科学实践。