Sample Efficient On-line Learning of Optimal Dialogue Policies with Kalman Temporal Differences
Olivier Pietquin, Matthieu Geist and Senthilkumar Chandramohan
Designing dialog policies for voice-enabled interfaces is a tailoring job that is most often left to natural language processing experts. This job is generally redone for every new dialog task because cross-domain transfer is not possible. For this reason, machine learning methods for dialog policy optimization have been investigated during the last 15 years. Especially, reinforcement learning is now part of the state of the art in this domain. Yet, it is very rare to find sample-efficient and off-policy algorithms designed for meeting the specific requirements of this task while data collection and annotation is time consuming and expensive. To alleviate the data sparsity problem, the research in this domain relies on artificial data expansion methods (bootstrapping) based on simulated dialogues and user models. In this contribution, a \emph{sample-efficient}, \emph{online} and \emph{off-policy} reinforcement learning algorithm is proposed to learn an optimal policy from few hundreds of dialogues generated with a very simple handcrafted policy, showing that bootstrapping may be avoided.