Tables are crucial containers of information, but understanding their meaning may be challenging. Indeed, recently, there has been a focus on Semantic Table Interpretation (STI), i.e., the task that involves the semantic annotation of tabular data to disambiguate their meaning. Over the years, there has been a surge in interest in data-driven approaches based on deep learning that have increasingly been combined with heuristic-based approaches. In the last period, the advent of Large Language Models (LLMs) has led to a new category of approaches for table annotation. The interest in this research field, characterised by multiple challenges, has led to a proliferation of approaches employing different techniques. However, these approaches have not been consistently evaluated on a common ground, making evaluation and comparison difficult. This work proposes an extensive evaluation of four state-of-the-art (SOTA) approaches - Alligator (formerly s-elBat), Dagobah, TURL, and TableLlama; the first two belong to the family of heuristic-based algorithms, while the others are respectively encoder-only and decoder-only LLMs. The primary objective is to measure the ability of these approaches to solve the entity disambiguation task, with the ultimate aim of charting new research paths in the field.
翻译:表格是信息的关键载体,但理解其含义可能具有挑战性。事实上,近期研究重点已转向语义表格解释(STI),即对表格数据进行语义标注以消除其含义歧义的任务。近年来,基于深度学习的数据驱动方法受到广泛关注,并越来越多地与基于启发式的方法相结合。最近,大语言模型(LLMs)的出现催生了表格语义标注的新方法类别。这一研究领域因存在多重挑战而备受关注,导致采用不同技术的方法大量涌现。然而,这些方法尚未在统一基准上得到系统评估,使得评估与比较工作难以开展。本研究对四种前沿方法——Alligator(原s-elBat)、Dagobah、TURL和TableLlama——进行了全面评估;前两种属于基于启发式的算法,后两者分别为仅编码器型和仅解码器型大语言模型。主要目标是衡量这些方法解决实体消歧任务的能力,最终目标是为该领域规划新的研究路径。