While table understanding increasingly relies on pixel-only settings, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 21 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. To evaluate whether models are able to jointly reason over tabular and visual content, we also introduce VisualTableQA, a benchmark requiring both visual perception and table understanding. Fine-tuning vision-language models like Qwen2.5-VL-7B and Gemma 3-4B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.
翻译:尽管表格理解日益依赖于纯像素设置,但当前基准测试主要使用合成渲染,缺乏真实世界表格的复杂性和视觉多样性。此外,现有的视觉表格理解(VTU)数据集提供具有单一可视化效果和预定义指令的固定示例,无法访问底层序列化数据以进行重构。我们提出了TABLET,这是一个大规模VTU数据集,包含21个任务中的400万个示例,基于200万个唯一表格,其中88%保留了原始可视化效果。为了评估模型是否能够联合推理表格内容和视觉内容,我们还引入了VisualTableQA,这是一个需要视觉感知和表格理解的基准测试。在TABLET上对Qwen2.5-VL-7B和Gemma 3-4B等视觉语言模型进行微调,可以提高在已见和未见VTU任务上的性能,同时增强对真实世界表格可视化效果的鲁棒性。通过在统一的大规模集合中保留原始可视化效果并保持示例可追溯性,TABLET为未来VTU模型的鲁棒训练和可扩展评估奠定了基础。