In this paper, we consider the low rank structure of the reward sequence of the pure exploration problems. Firstly, we propose the separated setting in pure exploration problem, where the exploration strategy cannot receive the feedback of its explorations. Due to this separation, it requires that the exploration strategy to sample the arms obliviously. By involving the kernel information of the reward vectors, we provide efficient algorithms for both time-varying and fixed cases with regret bound $O(d\sqrt{(\ln N)/n})$. Then, we show the lower bound to the pure exploration in multi-armed bandits with low rank sequence. There is an $O(\sqrt{\ln N})$ gap between our upper bound and the lower bound.
翻译:本文研究了纯探索问题中奖励序列的低秩结构。首先,我们提出了纯探索问题中的分离设置,其中探索策略无法接收其探索行为的反馈。由于这种分离,探索策略需要以无意识方式对臂进行采样。通过引入奖励向量的核信息,我们为时变和固定情形提供了高效算法,其遗憾界为$O(d\sqrt{(\ln N)/n})$。随后,我们给出了低秩序列多臂老虎机中纯探索问题的下界,且我们的上界与下界之间存在$O(\sqrt{\ln N})$的差距。