Exploration vs. Exploitation in Reinforcement Learning



The last five years have seen many new developments in reinforcement learning (RL), a very interesting sub-field of machine learning (ML). Publication of "Deep Q-Networks" from DeepMind, in particular, ushered in a new era. As RL comes into its own, it's becoming clear that a key concept in all RL algorithms is the tradeoff between exploration and exploitation. In this post, we will simulate a problem called the "multi-armed bandit" in order to understand the details of this tradeoff.

Goals of this post

  • Share some useful RL resources
  • Simulate a K-Armed bandit problem
  • Understand the tradeoff between Exploration and Exploitation in RL


There are several great resources on RL. Below are some of the best ones I've found for practitioners like myself. These are good starting points for understanding the foundations and learning by doing.

  • Richard Sutton and Andrew Barto's Reinforcement Learning: An Introduction (second edition) is beautifully written. I really like how they provide intuitive explanation of algorithms and pseudo-code. The latter is proving to be invaluable when I want to code up an algorithm and understand the details.
  • Andrej Karpathy wrote an excellent post almost two years ago. It's a great general introduction and also a good starting point for a type of RL algorithms called policy gradient methods.
  • Finally, course CS 294-112 from UC Berkeley (instructor: Sergey Levine) is very recent, with full lecture notes and videos available. I have only reviewed one lecture so far, but it looks very promising. 


  1. All of the code for this post can be found here: https://github.com/manifoldai/reinforcement_learning_public
  2. Jupyter notebook corresponding to this blog post can be found here: https://github.com/manifoldai/reinforcement_learning_public/blob/master/notebooks/Multi-armed%20bandit.ipynb
  3. A properly rendered version of the notebook with plots can be found here:https://nbviewer.jupyter.org/github/manifoldai/reinforcement_learning_public/blob/master/notebooks/Multi-armed%20bandit.ipynb

You may find Introduction to Reinforcement Learning and OpenAI Gym by Justin Francis helpful in understanding the code.

We will be using this library that is built on top of OpenAI Gym to simulate a 10-armed bandit problem.

General RL problem

Before we look at the multi-armed bandit problem, let's take a quick look at the general RL problem setting., captured in the picture below. There are two entities: agent and environment. At time t, the Agent observes state St from the environment and also receives a reward Rt. The agent then takes an action At. In response to action At, the environment provides the next state and reward pair, and the process continues. This setup represents what is called a Markov decision process. The goal of the agent is to maximize the cumulative reward it receives from the environment.


The most distinguishing feature of RL compared to supervised learning is that there are no labels associated with actions; there is only reward for each action taken.

Multi-armed bandit problem

Multi-armed bandit problem is a simple RL problem. At every time step, the agent can choose one of K actions. The agent then receives a reward that is drawn from an unknown (to the agent) probability distribution corresponding to the said action. The goal of the agent is to choose actions such that the total reward received within a certain number of time steps is maximized. The environment's state remains unchanged for all time steps. This simplifies the the problem considerably, and makes the successive time steps IID. This can be represented as shown below.


Where π΄π‘‘βˆˆ1,2,3....𝐾 and the reward Rt∼N(ΞΌk,Οƒ2) where k is the action taken.

We can estimate the reward distribution for each action by simulating an agent that takes random actions at every time step. Our simulation runs for 3000 steps. We can see below that our random agent chose the actions approximately evenly. 

grpby_count_actionWe can estimate the densities of the 10 distributions from this simulation data.


As we can see from the plot, if our agent selects the action that corresponds to the right-most distribution, we could maximize our reward. But instead of playing a for a large number of time steps to estimate the distributions, it would be good to have our agent:

  1. Continuously estimate the average reward for each action.
  2. Take actions based on these estimates according to some policy.

A class of methods in RL called action-value methods do exactly this.

Action-Value methods

The K-armed bandit problem can be solved using a Tabular method. It is quite simple: We maintain a table of actions and their estimated expected reward. The agent then uses this table to take actions based on some policy.

At the time step t, the estimate of expected reward for taking action a, π‘„𝑑(π‘Ž) = (sum of rewards when action a is taken until t-1) / number of times a is taken

One of the simplest policies is the greedy policy, where the agent always chooses the action with the maximum expected return.


Another approach is called epsilon-greedy policy, which takes action using the greedy policy with a probability of 1βˆ’πœ– and a random action with a probability of πœ–Ο΅. This approach ensures all the action space is explored.

We can define yet another policy, called decaying-epsilon-greedy method, where we slowly decay the epsilon overtime. 

Code found here trains three agents with these three policies. A snippet is shown below. 


Exploration vs Exploitation tradeoff

Through simulation we can understand how these three algorithms explore the action space and exploit the knowledge at any given time step. This tradeoff is fundamental to many RL algorithms. So it will be informative to look closely at the concept through this 10-armed bandit problem.


As we can see, the greedy policy explored very little, and settled on choosing action 5 very quickly. The epsilon-greedy and decaying-epsilon-greedy algorithms found the optimal action (action 7, in this case) early, but continued to explore. However, the decaying-epsilon-greedy algorithm explores less and less over time. By time step 3000, it is choosing the optimal action almost all the time.


Above, we can see the average rewards for the three policies over time. The greedy algorithm clearly converged to a sub-optimal point. The epsilon-greedy and decaying-epsilon-greedy algorithms converged to the optimal action (7 in this example). However, the epsilon greedy algorithm continues to pay the price of exploration, and therefore never catches up to the performance of the decaying-epsilon-greedy algorithm.

Note that due to randomness, the results may be different in another run. For example, just by pure luck, the greedy algorithm can end up selecting the optimal action from the very beginning. In that case, the other two algorithms will never match the performance of the greedy algorithm. But counting on luck is a bad strategyβ€”and over a long number of runs, it will be sub-optimal strategy.

Let's look at results from another run.



ere we see that the greedy algorithm converged to a sub-optimal action with very little exploration. The other two algorithms converged to the optimal action (action 7). The epsilon-greedy algorithm performed better initially between those two, but it lost to the decaying-epsilon-policy around time step 792. This is due to the fact that the epsilon-greedy algorithm continues to pay the cost of exploration, while the decaying-epsilon-greedy policy reduces this cost over time.


In this post we looked at the multi-armed bandit problem in RL, and simulated three different approaches to solve it. We studied the nuances between them using simulations. Specifically, we looked at the tradeoff between exploration and exploitation, which is a fundamental concept in RL.

Rajendra Koppula, VP of Machine Learning
Prior to Manifold, Rajendra spent nearly a decade at Qualcomm working with engineering teams developing statistical and wireless signal processing solutions. Code developed by his team now runs on millions of mobile devices worldwide.