As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB -- $12{,}510$ agent trajectories -- shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience -- as a firm refines its checklists and playbooks after each matter -- without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.
翻译:随着智能体能力的增强,法律领域的大语言模型智能体有望将文档密集型事务转化为可审查的工作成果——然而,可靠部署面临三大障碍:缺乏大规模证据表明当前最强模型与框架组合在端到端法律事务中的表现;缺乏适配法律垂直领域的智能体架构,仅有通用框架;在随新事实、权威判例和截止日期不断变化的环境中,缺乏让系统从自身成果中学习的机制。本文逐一解决这些问题。基于Harvey LAB的大规模实证研究(包含12,510条智能体轨迹)显示,即便是前沿智能体在单次执行中仍远未完成事务:按标准衡量的准确率随模型能力增强而提升,但严格的事务完成率停滞不前。我们提出帕特农(\textsc{Parthenon})——一种自进化的法律智能体框架,将模型(Model)、框架(Harness)、智能体角色(Agent)、法律知识(Knowledge)、确定性工具(Tools)和程序性技能(Skills)分解为可审计的模块,实现源追溯、日期与数字锚定、交付物合规及议题闭合。最后,反泄露学习循环将评分失败案例转化为任务无关的技能、工具和知识编辑,使系统像律所在每项事务后完善检查清单和办案手册一样,通过经验持续改进,而无需修改模型权重。在大规模实证分析中,\textsc{Parthenon}显著提升了最先进模型和框架在法律事务任务上的性能。