数学辅导中保持效用的去标识化：探究MathEd-PII基准数据集中的数值歧义问题 (Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset)

Large-scale sharing of dialogue-based data is instrumental for advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce dataset utility. This work asks how PII can be detected in math tutoring transcripts while preserving their educational utility. To address this challenge, we investigate the "numeric ambiguity" problem and introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, created through a human-in-the-loop LLM workflow that audits upstream redactions and generates privacy-preserving surrogates. The dataset contains 1,000 tutoring sessions (115,620 messages; 769,628 tokens) with validated PII annotations. Using a density-based segmentation method, we show that false PII redactions are disproportionately concentrated in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and LLM-based approaches with basic, math-aware, and segment-aware prompting. Math-aware prompting substantially improves performance over the baseline (F1: 0.821 vs. 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides both a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.

翻译：大规模共享基于对话的数据对推进教学科学至关重要，然而严格的去标识化仍是主要障碍。在数学辅导文本中，数值表达式常与结构化标识符（如日期或ID）相似，导致通用个人可识别信息检测系统过度删改核心教学内容，降低数据集效用。本研究探讨如何在数学辅导文本中检测PII的同时保持其教育效用。针对这一挑战，我们研究了“数值歧义”问题，并引入MathEd-PII——首个面向数学辅导对话的PII检测基准数据集，该数据集通过人机协同的LLM工作流创建，可审核上游删改操作并生成隐私保护替代内容。数据集包含1,000个辅导会话（115,620条消息；769,628个词元）及经过验证的PII标注。通过基于密度的分割方法，我们发现错误PII删改过度集中于数学密集区域，证实数值歧义是关键失效模式。随后我们比较了四种检测策略：Presidio基线系统与采用基础提示、数学感知提示及分段感知提示的LLM方案。数学感知提示较基线性能显著提升（F1值：0.821对比0.379），同时减少数值误报，证明去标识化必须融合领域语境以保持分析效用。本研究既提供了新基准，也证明辅导数据的效用保持型去标识化需要领域感知建模。