We present Cogniscope, an open evaluation framework for studying longitudinal early-risk AI systems under controlled behavioral drift, sparse observations, delayed evidence, and heterogeneous progression patterns. Cogniscope combines two complementary components: a synthetic simulation engine that generates privacy-preserving longitudinal behavioral traces aligned with configurable latent risk trajectories, and a browser-based data-collection instrument implemented as a Chrome extension for capturing naturalistic video interaction telemetry and micro-question responses during YouTube playback. The released benchmark includes 200,000 simulated video-interaction records from 200 users over 200 days, a 504-session schema-aligned synthetic deployment dataset across nine behavioral profiles, an 18-table relational schema, baseline evaluation scripts, and time-aware metrics including Early Risk Detection Error (ERDE) and time-to-detection (TTD). We emphasize that Cogniscope is not a diagnostic system and does not claim clinical validity. Instead, it provides a reusable testbed for evaluating how sequential models behave under known longitudinal challenges before deployment with real human-subject data. Experiments show that simple behavioral coherence signals separate simulated risk states under controlled priors, while rule-based deployment-profile classification remains challenging, motivating learned temporal models and robust evaluation protocols.
翻译:我们提出Cogniscope——一个开放的评估框架,用于研究受控行为漂移、稀疏观测、延迟证据及异质性进展模式下的纵向早期风险人工智能系统。Cogniscope包含两个互补组件:一个合成模拟引擎,可生成符合可配置潜在风险轨迹、保护隐私的纵向行为痕迹;以及一个基于浏览器的数据采集工具(以Chrome扩展形式实现),用于在YouTube播放过程中捕获自然交互视频遥测数据与微问答响应。已发布的基准数据集包含20万条来自200名用户、持续200天的模拟视频交互记录;涵盖9种行为谱型、含504个会话的符合模式化架构的合成部署数据集;18表关系模式;基线评估脚本;以及包括早期风险检测误差(ERDE)和检测时间(TTD)在内的时效性指标。我们强调,Cogniscope并非诊断系统,亦不宣称具备临床有效性。相反,它提供了一个可复用的测试平台,用于评估序列模型在真实人类受试者数据部署前,面对已知纵向挑战时的表现。实验表明,在受控先验条件下,简单的行为一致性信号可区分模拟风险状态,而基于规则部署的谱型分类仍具挑战性,这激励了学习型时序模型与稳健评估协议的研究。