Abstract
We study the problem of online learning Markov Decision Processes
(MDPs) when both the transition distributions and loss functions
are chosen by an adversary. We present an algorithm that, under a
mixing assumption, achieves Tłog|\Pi|+łog|\Pi| regret
with respect to a comparison set of policies \Pi. The regret
is independent of the size of the state and action spaces. When
expectations over sample paths can be computed efficiently and
the comparison set \Pi has polynomial size,
this algorithm is efficient.
We also consider the episodic adversarial online shortest path
problem. Here, in each episode an adversary may choose a weighted
directed acyclic graph with an identified start and finish node. The
goal of the learning algorithm is to choose a path that minimizes
the loss while traversing from the start to finish node. At the end
of each episode the loss function (given by weights on the edges)
is revealed to the learning algorithm. The goal is to minimize regret
with respect to a fixed policy for selecting paths. This problem is
a special case of the online MDP problem.
It was shown that for randomly chosen graphs
and adversarial losses, the problem can be efficiently solved. We
show that it also can be efficiently solved for adversarial
graphs and randomly chosen losses. When both graphs and losses
are adversarially chosen, we show that designing efficient algorithms for the
adversarial online shortest path problem (and hence for the
adversarial MDP problem) is as hard as learning parity with noise, a
notoriously difficult problem that has been used to design efficient
cryptographic schemes. Finally, we present an efficient algorithm whose
regret scales linearly with the number of distinct graphs.
Users
Please
log in to take part in the discussion (add own reviews or comments).