Large language models (LLMs) are entering legal workflows, yet we lack a jurisdiction-specific framework to assess their baseline competence therein. We use India's public legal examinations as a transparent proxy. Our multi-year benchmark assembles objective screens from top national and state exams and evaluates open and frontier LLMs under real-world exam conditions. To probe beyond multiple-choice questions, we also include a lawyer-graded, paired-blinded study of long-form answers from the Supreme Court's Advocate-on-Record exam. This is, to our knowledge, the first exam-grounded, India-specific yardstick for LLM court-readiness released with datasets and protocols. Our work shows that while frontier systems consistently clear historical cutoffs and often match or exceed recent top-scorer bands on objective exams, none surpasses the human topper on long-form reasoning. Grader notes converge on three reliability failure modes: procedural or format compliance, authority or citation discipline, and forum-appropriate voice and structure. These findings delineate where LLMs can assist (checks, cross-statute consistency, statute and precedent lookups) and where human leadership remains essential: forum-specific drafting and filing, procedural and relief strategy, reconciling authorities and exceptions, and ethical, accountable judgment.
翻译:大型语言模型(LLMs)正逐步进入法律工作流程,但目前尚缺乏针对特定司法管辖区的基准能力评估框架。本研究以印度公开法律考试作为透明化评估基准,通过整合国家级与邦级顶尖考试的客观筛选环节,构建了跨年度的评测体系,并在真实考试环境下对开源及前沿LLMs进行系统评估。为突破多项选择题的局限,本研究还引入了律师双盲评阅机制,对印度最高法院"案卷律师资格考试"中的长文本答案进行分级评估。据我们所知,这是首个基于真实考试、针对印度司法场景的LLMs法庭适用性评估基准,并同步公开了数据集与评估协议。研究表明:前沿模型在客观考试中能稳定达到历史及格线,且常达到或超越近年高分区间,但在长文本推理任务中均未超越人类最高水平。评阅意见集中指出三类可靠性缺陷:程序或格式合规性、法律依据或引证规范性、以及法庭场景适配的语体与结构。这些发现明确了LLMs的适用边界:其可辅助完成法律检索、跨法条一致性核查、法规与判例查询等工作,但在法庭文书起草与呈递、程序与救济策略制定、法律依据与例外条款的协调、以及需要伦理考量与责任认定的司法判断等领域,仍需人类专业主导。