Thompson sampling

What is Thompson sampling in reinforcement learning?

Thompson sampling is an algorithm that uses exploration and exploitation for the purpose of choosing the actions that would maximize the rewards earned. It is also known as Probability Matching or Posterior Sampling.

It is a reinforcement learning algorithm that follows exploration and exploitation to maximize the cumulative rewards gained by performing an action.

Exploration essentially involves performing an action multiple times. Now, the results from the exploration (which can either be rewards or penalties), help in determining the actions that will be performed further with an aim to maximize the reward earned and improve future performance.

In Thompson sampling, as more information is gathered, the search is decreased. This helps us get the most information in the fewest searches possible. The algorithm is more search-oriented when there is less data available, and less search-oriented when there is more data at our disposal.

Thompson sampling is essentially a heuristic for picking actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It involves picking the action that maximizes the expected reward with regards to a randomly-drawn belief.

‍

What is the Multi-Armed Bandit Problem?

The multi-armed bandit problem, though it may get you thinking about highwaymen, actually refers to multiple slot machines with varied payback percentages or a single slot machine with multiple arms, and each arm has a different payback percentage. Why are these slot machines referred to as bandits? Well, they are extremely popular in casinos, and gamblers feed them a lot of money in the hopes of winning big money, but most of them don’t win anything - you could basically say that they pretty much got robbed by the slot machine.

The Multi-Armed Bandit problem is one of the first and the best examples to explain the Thompson Sampling method.

The idea behind the multi-armed bandit problem is simple. If there are several slot machines, you are aiming to maximize the reward for a person pulling the lever of each machine randomly.

The goal is to create a policy for picking an action (in this problem, picking an action refers to choosing an arm to pull) at every step to maximize the total reward earned at the end of a pre-decided time period.

Let’s just say that there are 5 slot machines and you have a limited number of turns for you to pull the levers. The multi-armed bandit problem focuses on how you can determine maximum success.

It is a Markov Decision process problem.

‍

How does Thompson sampling work?

Thomspon sampling is essentially an algorithm that is used for online decision problems in which the actions are carried out sequentially in a way that needs to balance between exploiting what is known to maximize immediate performance and investing in accumulating

new information that could potentially improve future performance.

Thompson sampling works by exploring to resolve uncertainty when it is possible that resolution will assist the agent in identifying the optimal action, but it will not bother probing in places where feedback would not be beneficial in any way.

What is the intuition behind Thompson sampling in the multi-armed bandit problem?

It is assumed that all the slot machines have a uniform probability of earning a reward.
Based on the reward or penalty observed from each action, a new distribution is generated with probabilities of success for each machine.
Based on the probabilities or observations from each round, further observations are made.
There will be a success distribution associated with each machine after the right amount of observations. This will assist the player in choosing the machines in the right way to earn the highest possible reward.

‍

What are the applications of Thompson sampling?

The Thompson Sampling algorithm has been available for a very long time, it was first put forward by William R. Thompson in 1933 but was completely ignored for an extremely long time. It recently gained traction in several areas of artificial intelligence and can be used for a range of purposes. It can be used to predict stock prices to some extent, based on data currently available about stock prices. It can even be used to predict the delay in traffic signals or even in OTT platforms or eCommerce portals, to help item-based recommender engines to display images related to content or products that the users will be most likely to click on and watch or purchase. It can even be used to assist in automating the transportation and delivery of items.

Thompson sampling has even found uses in revenue management, marketing, website optimization, A/B testing, advertisements, gaming, and several other domains.

Is Thompson sampling better than UCB-1?

In cases where the difference in payouts between the variants that are being tested is known to be significantly large, both the algorithms will be able to demonstrate statistically significant differences between variants within a few data points. Thompson Sampling maximizes the payout here.

Both Thompson Sampling as well as Upper Confidence Bound 1 (UCB-1) can optimize the overall payout without sacrificing exploration of all the variants and detecting statistical differences between them. UCB-1 produces allocations that have a greater level of similarity to an A/B test, while Thompson is optimized for the purpose of maximizing long-term overall payoff.

When there are minor differences in variant performance, UCB-1 tends to be more conservative than Thomspon sampling.

Essentially, Thompson Sampling is better than UCB-1 in situations in which there is a higher baseline conversion rate or higher expected effect size, where the algorithm would be more stable. UCB-1 would be a better choice of MAB algorithm for a low base conversion rate, small effect size scenarios because of its consistency, and high level of tolerance towards noise in the data.