Optimality-based reward learning with applications to toxicology

In toxicology research, experiments are often conducted to determine the effect of toxicant exposure on the behavior of mice, where mice are randomized to receive the toxicant or not. In particular, in fixed interval experiments, one provides a mouse reinforcers (e.g., a food pellet), contingent upon some action taken by the mouse (e.g., a press of a lever), but the reinforcers are only provided after fixed time intervals. Often, to analyze fixed interval experiments, one specifies and then estimates the conditional state-action distribution (e.g., using an ANOVA). This existing approach, which in the reinforcement learning framework would be called modeling the mouse's "behavioral policy," is sensitive to misspecification. It is likely that any model for the behavioral policy is misspecified; a mapping from a mouse's exposure to their actions can be highly complex. In this work, we avoid specifying the behavioral policy by instead learning the mouse's reward function. Specifying a reward function is as challenging as specifying a behavioral policy, but we propose a novel approach that incorporates knowledge of the optimal behavior, which is often known to the experimenter, to avoid specifying the reward function itself. In particular, we define the reward as a divergence of the mouse's actions from optimality, where the representations of the action and optimality can be arbitrarily complex. The parameters of the reward function then serve as a measure of the mouse's tolerance for divergence from optimality, which is a novel summary of the impact of the exposure. The parameter itself is scalar, and the proposed objective function is differentiable, allowing us to benefit from typical results on consistency of parametric estimators while making very few assumptions.

翻译：在毒理学研究中，常通过实验确定毒物暴露对小鼠行为的影响，实验中小鼠被随机分配到接受毒物或未接受毒物的组别。特别是在固定间隔实验中，小鼠需执行特定动作（如按压杠杆）才能获得强化物（如食物颗粒），但强化物仅在固定时间间隔后提供。为分析固定间隔实验，通常需要先设定并估计条件状态-动作分布（例如使用方差分析）。这种现有方法在强化学习框架中相当于建模小鼠的"行为策略"，但容易受到模型设定错误的影响。由于从小鼠暴露到其行为的映射可能极为复杂，任何行为策略模型都极可能存在设定偏差。本研究通过学习小鼠的奖励函数来避免指定行为策略。尽管设定奖励函数与设定行为策略同样困难，但我们提出了一种创新方法：利用实验者通常已知的最优行为知识来规避直接设定奖励函数。具体而言，我们将奖励定义为小鼠行为与最优行为之间的偏离度，其中动作表征和最优性表征可采用任意复杂形式。奖励函数的参数可作为衡量小鼠对最优偏离容忍度的指标，这构成了毒物暴露影响的新型量化总结。该参数本身为标量，且所提出的目标函数可微，这使得我们在极少的假设条件下仍能利用参数估计量一致性的典型结论。