Multi-Armed Bandit Tutorial

Interactive demonstration of Multi-Armed Bandit algorithms and human decision-making.

Multi-Armed Bandit Problem

The Multi-Armed Bandit problem involves sequential decision-making with limited information. Each "arm" (slot machine) has an unknown reward distribution:

\[r_i \sim \mathcal{N}(\mu_i, \sigma_i^2)\]

where \(\mu_i\) is the mean reward and \(\sigma_i^2\) is the variance for arm \(i\).

Common Strategies

1. ε-Greedy:

\[a_t = \begin{cases} \arg\max_i \hat{\mu}_i & \text{with probability } 1-\varepsilon \\ \text{random arm} & \text{with probability } \varepsilon \end{cases}\]

2. Upper Confidence Bound (UCB1):

\[a_t = \arg\max_i \left(\hat{\mu}_i + \sqrt{\frac{2\ln t}{n_i}}\right)\]

where \(t\) is the total number of pulls and \(n_i\) is the number of times arm \(i\) was pulled.

3. Thompson Sampling:

\[\theta_i \sim \text{Beta}(\alpha_i, \beta_i)\] \[a_t = \arg\max_i \theta_i\]
Your Rewards
Random
ε-Greedy
UCB1
Thompson Sampling

Experiment Parameters

5
0.1
0.5
20

Your Performance

Total Reward: 0

Average Reward: 0

Best Arm Found: -

AI Performance

Random Reward: 0

ε-Greedy Reward: 0

UCB1 Reward: 0

Thompson Reward: 0