Lawyer-client consultation is a critical starting point for legal services. Effective legal assistance hinges on eliciting sufficient and truthful information from clients in order to devise strategies that best protect their interests. This task requires Large Language Models (LLMs) not only to perform robust legal reasoning, but also to strategically elicit material facts through multi-turn interactions and effectively guide clients with diverse personalities. Yet existing legal benchmarks overlook this interactive capability. To fill this gap, we introduce DLawBench, a diagnostic benchmark for real-world legal consultation. Drawing on realistic client behavior, we characterize lawyer-client interactions into four types: Cooperative, Dependent, Withdrawn, and Adversarial. Using dialogues grounded in real cases, DLawBench evaluates whether LLMs can effectively conduct legal consultation under realistic conditions. DLawBench comprises 461 cases from Chinese and U.S. law, 5,532 paired fact entries, 3,411 inquiry rubrics, and 3,348 issue-resolution rubrics, and evaluates 26 representative LLMs. Systematic experiments show substantial headroom: the best-performing model, GPT-5.5, achieves only 0.562 on consultation-grounded legal reasoning. More importantly, DLawBench exposes both sycophancy in legal consultation and a paradox: models perform worse when clients need guidance most.
翻译:律师与客户之间的咨询是法律服务的关键起点。有效的法律协助依赖于从客户处获取充分且真实的信息,以制定最能保护其利益的策略。这一任务要求大语言模型(LLMs)不仅具备稳健的法律推理能力,还需通过多轮互动策略性地挖掘关键事实,并有效引导具有不同个性的客户。然而,现有法律基准忽略了这种交互能力。为填补这一空白,我们提出了DLawBench,一个面向真实法律咨询的诊断基准。基于真实的客户行为,我们将律师与客户的互动分为四种类型:合作型、依赖型、退缩型与对抗型。利用基于真实案例的对话,DLawBench评估LLMs在现实条件下能否有效开展法律咨询。该基准包含来自中国和美国法律的461个案例、5532条配对事实条目、3411条询问评估准则及3348条问题解决评估准则,并评估了26个代表性LLM。系统实验表明存在显著提升空间:性能最优的模型GPT-5.5在基于咨询的法律推理中仅达到0.562的得分。更重要的是,DLawBench揭示了法律咨询中的阿谀奉承现象及一个悖论:当客户最需要引导时,模型表现反而更差。