How capable are large language models (LLMs) in the domain of taxation? Although numerous studies have explored the legal domain, research dedicated to taxation remains scarce. Moreover, the datasets used in these studies are either simplified, failing to reflect the real-world complexities, or not released as open-source. To address this gap, we introduce PLAT, a new benchmark designed to assess the ability of LLMs to predict the legitimacy of additional tax penalties. PLAT comprises 300 examples: (1) 100 binary-choice questions, (2) 100 multiple-choice questions, and (3) 100 essay-type questions, all derived from 100 Korean court precedents. PLAT is constructed to evaluate not only LLMs' understanding of tax law but also their performance in legal cases that require complex reasoning beyond straightforward application of statutes. Our systematic experiments with multiple LLMs reveal that (1) their baseline capabilities are limited, especially in cases involving conflicting issues that require a comprehensive understanding (not only of the statutes but also of the taxpayer's circumstances), and (2) LLMs struggle particularly with the "AC" stages of "IRAC" even for advanced reasoning models like o3, which actively employ inference-time scaling. The dataset is publicly available at: https://huggingface.co/collections/sma1-rmarud/plat-predicting-the-legitimacy-of-punitive-additional-tax
翻译:大型语言模型(LLMs)在税收领域的能力如何?尽管已有大量研究探索法律领域,但专门针对税收的研究仍然稀缺。此外,这些研究中使用的数据集要么过于简化,未能反映现实世界的复杂性,要么未作为开源资源发布。为填补这一空白,我们提出了PLAT,这是一个旨在评估LLMs预测加算税处罚合法性的新基准。PLAT包含300个示例:(1)100个二选一问题,(2)100个多项选择题,以及(3)100个论述型问题,所有问题均源自100个韩国法院判例。PLAT的设计不仅评估LLMs对税法的理解,还评估其在需要超越法规直接应用的复杂推理的法律案件中的表现。我们对多个LLMs的系统性实验表明:(1)其基线能力有限,尤其是在涉及冲突问题、需要全面理解(不仅包括法规,还包括纳税人具体情况)的案件中;(2)LLMs在“IRAC”的“AC”阶段表现尤为困难,即使对于像o3这样积极采用推理时缩放的高级推理模型也是如此。该数据集公开发布于:https://huggingface.co/collections/sma1-rmarud/plat-predicting-the-legitimacy-of-punitive-additional-tax