The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT-Bench, the first PTAB-centric benchmark that aligns PTAB decisions with USPTO patent data at the case-level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed-source (commercial) and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed-source models consistently exceed 0.75 in Micro-F1 score, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at https://github.com/TeamLab/pilot-bench.
翻译:美国专利商标局(USPTO)的专利审判与上诉委员会(PTAB)每年裁决数千件单方复审上诉案件,这要求融合技术理解与法律推理能力。尽管大语言模型(LLMs)在专利和法律实践中的应用日益增多,但其使用仍局限于轻量级任务,且缺乏系统评估其在专利领域结构化法律推理能力的既定方法。本研究提出了PILOT-Bench,这是首个以PTAB为核心的基准测试,它在案件层面将PTAB裁决与USPTO专利数据对齐,并形式化了三项IRAC对齐的分类任务:争议类型、委员会依据与次级裁决。我们评估了包括闭源(商业)与开源模型在内的多样化LLMs,并从多角度进行分析,涵盖输入变体设置、模型家族及错误倾向。值得注意的是,在争议类型任务中,闭源模型的Micro-F1分数持续超过0.75,而表现最强的开源模型(Qwen-8B)仅达到约0.56,凸显了推理能力上的显著差距。PILOT-Bench为系统评估专利领域法律推理奠定了基础,并指明了通过数据集设计与模型对齐来改进LLMs的未来方向。所有数据、代码及基准资源均公开于https://github.com/TeamLab/pilot-bench。