Bayesian Policy Search with Policy Priors
David Wingate, Leslie Kaelbling, Dan Roy, Noah Goodman and Joshua Tenenbaum
We consider the problem of learning to act in partially observable, continuous-state-and-action worlds where we have abstract prior knowledge about the structure of the optimal policy in the form of a distribution over policies. Using ideas from planning-as-inference reductions and Bayesian unsupervised learning, we cast tempered Markov Chain Monte Carlo simulation as a policy search algorithm whose bias is expressed by hierarchical models, nonparametric priors and structured, recursive processes like grammars over action sequences. Inference over the latent variables in these policy priors enables intra- and intertask transfer of abstract knowledge. We demonstrate the flexibility of this approach on a series of environments of increasing difficulty.