Feature leakage and the identifiability of direct-dependency entropy models of neural activity

Biological neurons receive thousands of synaptic inputs on branching, electrically excitable dendrites, yet population activity is often modeled with direct input-output rules in which each input contributes independently to a scalar drive. We study what successful prediction by such models does, and does not, reveal about neural computation. For conditional maximum-entropy models that match output rates and pairwise output-input coactivities, the entropy explained by a direct model is a prediction measure under the sampled input distribution, not a mechanism-identification test. A restricted MaxEnt fit is an information projection: omitted interaction, temporal, or hidden-state terms can be absorbed into fitted first-order parameters whenever they are correlated with the included sufficient statistics. For sparse correlated binary inputs, this absorption has an explicit coskewness form. We introduce diagnostics that separate in-distribution prediction from recovery of the response rule: state reweighting that holds P(y|x) fixed while changing P(x), conditional log-odds contrasts for local additivity, and temporal leakage controls. In ground-truth simulations, purely higher-order responses can pass first-order entropy and raw coactivity tests under leakage-prone sampling, but are correctly classified after reweighting. Applied to selected, leakage-enriched local tables from CA1 hippocampal recordings, approximately half of tables that appear first-order under empirical weights become distribution-sensitive under balanced reweighting, far above a matched additive-surrogate null. Thus direct entropy-explained fractions and raw coactivity predictions should be interpreted as predictions under the observed state distribution, not as evidence that mechanisms outside the direct model are absent or small.

翻译：生物神经元在具有分枝状、电兴奋树突上接收数千个突触输入，但群体活动通常通过直接输入-输出规则建模，其中每个输入独立贡献于标量驱动。我们研究此类模型成功预测能揭示和不能揭示神经计算的哪些方面。对于匹配输出速率与成对输出-输入共活性的条件最大熵模型，直接模型所解释的熵是采样输入分布下的预测度量，而非机制识别检验。受限最大熵拟合本质上是信息投影：当省略的相互作用项、时间项或隐状态项与包含的充分统计量相关时，它们可被吸收到拟合的一阶参数中。对于稀疏相关的二值输入，这种吸收具有明确的联合偏度形式。我们引入将分布内预测与响应规则恢复相分离的诊断方法：在固定P(y|x)的同时改变P(x)的状态重加权、局部可加性的条件对数几率对比以及时间泄露控制。在真实模拟中，纯高阶响应在泄露倾向采样下可通过一阶熵和原始共活性检验，但在重加权后被正确分类。应用于来自CA1海马记录的选定泄露丰富局部表，约一半在经验权重下表现为一阶的表在平衡重加权后变得分布敏感，远高于匹配加性伪零模型。因此，直接熵解释分数和原始共活性预测应解释为在观测状态分布下的预测，而非直接模型之外机制不存在或微弱的证据。