Abstract
We propose a new approach to the problem of searching a space of policies
for a Markov decision process (MDP) or a partially observable Markov
decision process (POMDP), given a model. Our approach is based on
the following observation: Any (PO)MDP can be transformed into an
�equivalent� POMDP in which all state transitions (given the current
state and action) are deterministic. This reduces the general problem
of policy search to one in which we need only consider POMDPs with
deterministic transitions. We give a natural way of estimating the
value of all policies in these transformed POMDPs. Policy search
is then simply performed by searching for a policy with high estimated
value. We also establish conditions under which our value estimates
will be good, recovering theoretical results similar to those of
Kearns, Mansour and Ng 7, but with �sample complexity� bounds that
have only a polynomial rather than exponential dependence on the
horizon time. Our method applies to arbitrary POMDPs, including ones
with infinite state and action spaces. We also present empirical
results for our approach on a small discrete problem, and on a complex
continuous state/continuous action problem involving learning to
ride a bicycle.
Users
Please
log in to take part in the discussion (add own reviews or comments).