Exploration and exploitation in a non-stationary environment Seminar

Time:: 15:00 - 16:00
Date:: 31 January 2018
Venue:: The Ketley Room (B54 - level 4)

Event details

We describe the problem of selecting the best option amongst a range of sub-optimal candidates, with the aim of maximizing long term rewards. As we have no information about the performance of the different options before the start of the experiment, the algorithm is designed to find the right trade-off between exploration and exploitation in the framework of online learning. While we focus on an application from an online travel agency, online learning has applications ranging from clinical trials to advertising and website optimization. In this particular study, an additional difficulty is the fact that there is seasonality in the demand. This complication is very common in areas such as sales and marketing where the demand for particular products varies for different seasons and periods of time. We use Thompson Sampling, a Multi-Armed Bandit algorithm, which based on Bayesian updating, uses new information obtained at each time step to decide which “arm” to play next, where arms are synonymous with options. Thompson Sampling was chosen due to its very good empirical performance. In order to tackle the challenge of non-stationarity, we are making some adjustments to the initial formulation of the algorithm in order to first capture seasonality from the data and then use it as contextual information when applying Thompson Sampling, developing a so-called “contextual bandit”. The contextual approach is more often used by organisations whose aim is to tackle decision making in the face of uncertainty by using side information about their clients and the options they are comparing.

Speaker information

Andria Ellina, Postgraduate Research Student of Christine Currie