Most learning algorithms with formal regret guarantees assume that no mistake is irreparable and essentially rely on trying all possible behaviors. This approach is problematic when some mistakes are \emph{catastrophic}, i.e., irreparable. We propose an online learning problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the payoff in each round represents the chance of avoiding catastrophe that round and aim to maximize the product of payoffs (the overall chance of avoiding catastrophe) while allowing a limited number of queries to a mentor. We first show that in general, any algorithm either constantly queries the mentor or is nearly guaranteed to cause catastrophe. However, in settings where the mentor policy class is learnable in the standard online learning model, we provide an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows. Conceptually, if a policy class is learnable in the absence of catastrophic risk, it is learnable in the presence of catastrophic risk if the agent can ask for help.
翻译:大多数具有形式化遗憾保证的学习算法都假设不存在不可挽回的错误,本质上依赖于尝试所有可能的行为。当某些错误具有灾难性(即不可挽回)时,这种方法会产生问题。我们提出了一种在线学习问题,其目标是最大限度地降低灾难发生的概率。具体而言,我们假设每轮收益代表该轮避免灾难的概率,旨在最大化收益乘积(即整体避免灾难的概率),同时允许有限次数地向指导者进行查询。我们首先证明,在一般情况下,任何算法要么需要持续查询指导者,要么几乎必然导致灾难。然而,在指导者策略类可在标准在线学习模型中学习的设定下,我们提出了一种算法,其遗憾值和指导者查询率均随时间范围增长而趋近于零。从概念上讲,如果一个策略类在无灾难性风险时可学习,那么在智能体能够寻求帮助的情况下,该策略类在存在灾难性风险时同样可学习。