CARE: A Capability-Based Measurement Framework for Reproductive Equity in Human-AI Interaction

Algorithmic systems mediate sexual and reproductive health (SRH) information seeking. Standard HCI and AI evaluation centers usability, accuracy, and interaction quality, measures designed to assess task performance and interaction quality at the system level. We introduce CARE, the Capability Approach for Reproductive Equity, a measurement framework for human-AI interaction that adds capability outcomes as a unit of evaluation above task performance. CARE functions in two parts. The Normative Design Lens identifies the resources, conversion factors, capabilities, and functionings a system should support. The Evaluation lens assesses how design features, interaction patterns, and social conditions shape capability outcomes, tradeoffs, and lived experiences in use. We apply CARE to SRH-specific chatbots, general-purpose LLMs, and search engine features in a study with 12 participants, demonstrating that it surfaces capability outcomes standard metrics aggregate away. The same design features expanded capabilities for some users while constraining them for others: source-level organization, response format, tone, and SRH-specific features all shaped which capabilities expanded for which users and in which direction. Participants' professional backgrounds, gender identities, and prior AI familiarity further shaped these effects, producing capability outcomes that usability and accuracy metrics, aggregated across users, would not surface. These findings demonstrate capability outcomes as a measurable unit for human-AI interaction evaluation, extending existing metrics with a capability layer above task performance.

翻译：算法系统在性与生殖健康（SRH）信息搜寻过程中发挥中介作用。标准的人机交互与人工智能评估聚焦于可用性、准确性和交互质量，这些指标旨在从系统层面衡量任务表现与交互质量。我们提出CARE（面向生殖公平性的能力方法），这是一个将能力产出作为超越任务表现的评估单元引入人机交互的度量框架。CARE由两个部分构成：规范性设计透镜识别系统应支持的资源、转化因素、能力与功能实现；评估透镜分析设计特征、交互模式与社会条件如何在实际使用中塑造能力产出、权衡与生活经验。我们通过12名参与者的研究，将CARE应用于SRH专用聊天机器人、通用大语言模型及搜索引擎功能，证明该框架能揭示标准指标所忽略的能力产出。相同设计特征为部分用户扩展能力，却限制其他用户：源层面组织、响应格式、语气及SRH专用特征均影响不同用户群体的能力扩展方向与程度。参与者的职业背景、性别认同及先前AI熟悉程度进一步调节这些效应，产生可用性与准确性指标在用户聚合统计中无法呈现的能力产出。这些发现证实能力产出可作为人机交互评估的可量化单元，为任务表现之上的能力层补充现有指标体系。