Access to justice remains limited for many people, leading laypersons to increasingly rely on Large Language Models (LLMs) for legal self-help. Laypeople use these tools intuitively, which may lead them to form expectations based on incomplete, incorrect, or biased outputs. This study examines whether leading LLMs exhibit gender bias in their responses to a realistic family law scenario. We present an expert-designed divorce scenario grounded in Czech family law and evaluate four state-of-the-art LLMs GPT-5 nano, Claude Haiku 4.5, Gemini 2.5 Flash, and Llama 3.3 in a fully zero-shot interaction. We deploy two versions of the scenario, one with gendered names and one with neutral labels, to establish a baseline for comparison. We further introduce nine legally relevant factors that vary the factual circumstances of the case and test whether these variations influence the models' proposed shared-parenting ratios. Our preliminary results highlight differences across models and suggest gender-dependent patterns in the outcomes generated by some systems. The findings underscore both the risks associated with laypeople's reliance on LLMs for legal guidance and the need for more robust evaluation of model behavior in sensitive legal contexts. We present exploratory and descriptive evidence intended to identify systematic asymmetries rather than to establish causal effects.
翻译:对许多人而言,司法可及性仍然有限,导致非专业人士日益依赖大型语言模型(LLMs)获取法律自助服务。非专业人士凭直觉使用这些工具,这可能使他们基于不完整、错误或有偏见的输出形成预期。本研究考察了主流LLMs在对现实家庭法情境的回应中是否表现出性别偏见。我们提出了一个基于捷克家庭法的专家设计的离婚情境,并以完全零样本交互的方式评估了四种先进LLMs:GPT-5 nano、Claude Haiku 4.5、Gemini 2.5 Flash和Llama 3.3。我们部署了该情境的两个版本:一个使用带有性别特征的姓名,另一个使用中性标签,以建立比较基线。我们进一步引入了九个法律相关因素,这些因素改变了案件的事实情境,并测试这些变化是否影响模型提出的共同抚养比例。我们的初步结果凸显了模型之间的差异,并表明某些系统生成的结果存在与性别相关的模式。这些发现既强调了非专业人士依赖LLMs获取法律指导所伴随的风险,也表明有必要在敏感的法律情境中对模型行为进行更稳健的评估。我们提供的探索性和描述性证据旨在识别系统性的不对称,而非建立因果关系。