Abstract
In a partial monitoring game, the learner repeatedly chooses an action, the
environment responds with an outcome, and then the learner suffers a loss and
receives a feedback signal, both of which are fixed functions of the action and
the outcome. The goal of the learner is to minimize his regret, which is the
difference between his total cumulative loss and the total loss of the best
fixed action in hindsight.
In this paper we characterize the minimax regret of any
partial monitoring game with finitely many actions and
outcomes. It turns out that the minimax regret of any such game is either zero,
Theta~(T^1/2), Theta(T^2/3), or Theta(T). We provide computationally efficient learning
algorithms that achieve the minimax regret within logarithmic factor for any game. In addition to the bounds on the minimax regret, if we assume that the outcomes are generated in an i.i.d. fashion, we prove individual upper bounds on the expected regret.
Users
Please
log in to take part in the discussion (add own reviews or comments).