人工智能、元认知与验证瓶颈：一项关于人类问题解决的三波纵向研究 (AI, Metacognition, and the Verification Bottleneck: A Three-Wave Longitudinal Study of Human Problem-Solving)

This longitudinal pilot study tracked how generative AI reshapes problem-solving over six months across three waves in an academic setting. AI integration reached saturation by Wave 3, with daily use rising from 52.4% to 95.7% and ChatGPT adoption from 85.7% to 100%. A dominant hybrid workflow increased 2.7-fold, adopted by 39.1% of participants. The verification paradox emerged: participants relied most heavily on AI for difficult tasks (73.9%) yet showed declining verification confidence (68.1%) where performance was worst (47.8% accuracy on complex tasks). Objective performance declined systematically: 95.2% to 81.0% to 66.7% to 47.8% across problem difficulty, with belief-performance gaps widening to 34.6 percentage points. This indicates a fundamental shift where verification, not solution generation, became the bottleneck in human-AI problem-solving. The ACTIVE Framework synthesizes findings grounded in cognitive load theory: Awareness and task-AI alignment, Critical verification protocols, Transparent human-in-the-loop integration, Iterative skill development countering cognitive offloading, Verification confidence calibration, and Ethical evaluation. The authors provide implementation pathways for institutions and practitioners. Key limitations include sample homogeneity (academic cohort only, convenience sampling) limiting generalizability to corporate, clinical, or regulated professional contexts; self-report bias in confidence measures (32.2 percentage point divergence from objective performance); lack of control conditions; restriction to mathematical/analytical problems; and insufficient timeframe to assess long-term skill trajectories. Results generalize primarily to early-adopter, academically affiliated populations. Causal validation requires randomized controlled trials.

翻译：这项纵向试点研究追踪了生成式人工智能如何在学术环境中历时六个月、跨越三个波次重塑问题解决过程。到第三波时，人工智能整合已达到饱和状态：日常使用率从52.4%上升至95.7%，ChatGPT采用率从85.7%提升至100%。占据主导地位的混合工作流程增长了2.7倍，被39.1%的参与者采用。验证悖论显现：参与者对困难任务最依赖人工智能（73.9%），但在表现最差的任务中（复杂任务准确率47.8%）却呈现验证信心下降趋势（68.1%）。客观表现呈现系统性下降：随着问题难度递增，准确率从95.2%降至81.0%、66.7%直至47.8%，信念与表现的差距扩大至34.6个百分点。这表明问题解决范式发生根本性转变——验证环节而非方案生成环节，已成为人机协同问题解决中的瓶颈。基于认知负荷理论构建的ACTIVE框架整合研究发现：意识与任务-人工智能匹配度、关键验证协议、透明化人机回环整合、对抗认知卸载的迭代技能发展、验证信心校准及伦理评估。作者为机构和从业者提供了实施路径。主要局限性包括：样本同质性（仅学术群体、便利抽样）限制了向企业、临床或受监管专业场景的推广性；信心测量的自我报告偏差（与客观表现存在32.2个百分点差异）；缺乏控制条件；局限于数学/分析类问题；评估长期技能轨迹的时间跨度不足。研究结果主要适用于早期采用者及学术关联群体。因果验证需通过随机对照试验实现。