The use of visually-rich documents (VRDs) in various fields has created a demand for Document AI models that can read and comprehend documents like humans, which requires the overcoming of technical, linguistic, and cognitive barriers. Unfortunately, the lack of appropriate datasets has significantly hindered advancements in the field. To address this issue, we introduce \textsc{DocTrack}, a VRD dataset really aligned with human eye-movement information using eye-tracking technology. This dataset can be used to investigate the challenges mentioned above. Additionally, we explore the impact of human reading order on document understanding tasks and examine what would happen if a machine reads in the same order as a human. Our results suggest that although Document AI models have made significant progress, they still have a long way to go before they can read VRDs as accurately, continuously, and flexibly as humans do. These findings have potential implications for future research and development of Document AI models. The data is available at \url{https://github.com/hint-lab/doctrack}.
翻译:视觉丰富文档(VRDs)在多个领域的广泛应用,催生了能够像人类一样阅读和理解文档的文档人工智能(Document AI)模型的发展需求,这需要克服技术、语言和认知层面的障碍。然而,合适数据集的匮乏严重阻碍了该领域的进步。为解决这一问题,我们提出了\textsc{DocTrack}——一个利用眼动追踪技术、真正与人类眼动信息对齐的VRD数据集。该数据集可用于探究上述挑战。此外,我们探讨了人类阅读顺序对文档理解任务的影响,并考察了机器以与人类相同顺序阅读时的表现。结果表明,尽管文档AI模型取得了显著进展,但在准确、连续和灵活地阅读VRDs方面,它们与人类仍有很大差距。这些发现对文档AI模型的未来研究与发展具有潜在启示意义。数据获取地址:\url{https://github.com/hint-lab/doctrack}。