Inproceedings,

Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Y. Abbasi-Yadkori, P. Bartlett, V. Kanade, Y. Seldin, and {. Szepesvári.
NIPS, page 2508--2516. (December 2013)

Abstract

We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves Tłog|\Pi|+łog|\Pi| regret with respect to a comparison set of policies \Pi. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set \Pi has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. Finally, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs.

BibTeX key: AYBKSSz13
entry type: inproceedings
booktitle: NIPS
year: 2013
month: December
pages: 2508--2516
ee: http://papers.nips.cc/paper/4975-online-learning-in-markov-decision-processes-with-adversarially-chosen-transition-probability-distributions
date-added: 2013-11-29 18:59:40 +0200
pdf: papers/ChangingTransNIPS2013.pdf
date-modified: 2016-04-22 02:25:06 +0000

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Conference Paper %1 AYBKSSz13 %A Abbasi-Yadkori, Y. %A Bartlett, P. %A Kanade, V. %A Seldin, Y. %A Szepesvári, Cs. %B NIPS %D 2013 %K MDPs, adversarial finite learning learning, online reinforcement setting, theory, %P 2508--2516 %T Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions %X We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves Tłog|\Pi|+łog|\Pi| regret with respect to a comparison set of policies \Pi. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set \Pi has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. Finally, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs.

@inproceedings{AYBKSSz13, abstract = {We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves \sqrt{T\log|\Pi|}+\log|\Pi| regret with respect to a comparison set of policies \Pi. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set \Pi has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. Finally, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. }, added-at = {2020-03-17T03:03:01.000+0100}, author = {Abbasi-Yadkori, Y. and Bartlett, P. and Kanade, V. and Seldin, Y. and Szepesv{\'a}ri, {Cs}.}, biburl = {https://www.bibsonomy.org/bibtex/2639e6dcadb678b28e358dfac802b1919/csaba}, booktitle = {NIPS}, date-added = {2013-11-29 18:59:40 +0200}, date-modified = {2016-04-22 02:25:06 +0000}, ee = {http://papers.nips.cc/paper/4975-online-learning-in-markov-decision-processes-with-adversarially-chosen-transition-probability-distributions}, interhash = {1203345e58ed782cd0fe0563ac4a5f58}, intrahash = {639e6dcadb678b28e358dfac802b1919}, keywords = {MDPs, adversarial finite learning learning, online reinforcement setting, theory,}, month = {December}, pages = {2508--2516}, pdf = {papers/ChangingTransNIPS2013.pdf}, timestamp = {2020-03-17T03:03:01.000+0100}, title = {Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions}, year = 2013 }

BibSonomy

Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on