Q learning vs SARSA reinforcement learning algorithms
N. Mughees | August 23, 2023When beginning to study reinforcement learning, temporal difference learning is frequently used as an entry point. In order to elaborate on this concept and demonstrate the fundamentals of reinforcement learning, two well-known algorithms are frequently utilized: Q-learning and state-action-reward-state-action (SARSA). These algorithms appear identical at first glance, and it may be difficult to understand the significance of the differences between them. This article explains that very thing.
What is a basic reinforcement learning problem?
An agent and an environment make up the bare minimum of a reinforcement learning problem set. There are actions the agent can do in the environment that will earn a reward. The agent is not informed what to do and must learn via trial and error what would bring in the most reward.
The problem simplifies considerably when the dynamics of the surrounding environment are already known. Since the agent doesn't have to do any exploring to find the best course of action, it may leverage some basic notions from dynamic programming to do so. This approach, known as model-based, has the agent make decisions based on a representation it has created of its surroundings. The challenge, however, becomes considerably more complex when the dynamics of the environment are unknown. Model-free describes this situation where both exploration and exploitation are required. The method of learning from temporal differences is useful here.
What are Q-table, exploration, exploitation and action selection strategies?
Prior understanding of essential ideas is necessary for grasping the distinction between SARSA and Q-learning.
Q-table
The agent's decisions must be based on the rewards it has received in the past. It achieves this by storing Q-values in a table. Each state and corresponding action's Q-value is recorded there. The key distinction between SARSA and Q-learning is in the agent's method of updating the Q-value. It can pick the action with the largest predicted reward, based on the action selection technique.
Exploration and exploitation
The agent must balance exploitation and exploration in order to arrive at a good policy. The objective of each agent is to maximize its long-term benefit. However, because the dynamics of the environment are unknown, the agent cannot predict, which behaviors will provide the greatest reward. It can only learn these behaviors by experimenting with options it has never picked before. There is thus a trade-off between exploring new actions and exploiting the benefits of routine actions.
Action selection strategies
The action the agent pursues is determined by the action selection strategy. The most elementary approach is the greedy strategy, which seeks the greatest possible reward regardless of the cost. That is, it always takes advantage of the move that will net it the most profit. However, this method of selecting activities will likely fail to take into account certain superior options. The ε-greedy (epsilon-greedy) is another popular option. Using the epsilon variable, ε-greedy strikes a middle ground between exploring and exploiting. It provides a slight probability (the epsilon) to select potential actions equally and randomly from the Q-table rather than always picking the one with the best-estimated value.
What are on-policy and off-policy reinforcement learning algorithms?
On-policy and off-policy algorithms are the two main types of temporal difference reinforcement learning algorithms. The on-policy method uses the same approach for the behavior and the desired policy. Algorithms that operate "off-policy" use an alternative approach to the behavior and desired policy.
What are the main differences between SARSA and Q-learning?
- Q-learning is an off-policy temporal difference method and SARSA is an on-policy learning algorithm.
- SARSA updates the Q-values based on the Q-value of the subsequent state and the subsequent action of the policy.
- In Q-learning, the Q-value is updated by utilizing the Q-value of the subsequent state and the greedy (maximum) action that follows. A Q-learning agent will always select the action with the highest Q-value since it has learned its policy based on the optimum policy.
Let's understand the above differences using an example of walking off the cliff. Assume a reinforcement learning agent in a cliff world must make its way from the first cell to the last cell by walking along the edge of a cliff. The cost of taking a step is -1 while falling off the cliff is -20. A direct course along the cliff will get one there the quickest. It's the safest route, but if the agents make a mistake, it might cost them a large negative penalty of -20.
Although we are aware of the quickest route, our Q-learning and SARSA agents will differ as to whether or not it is the optimal one. Our on-policy SARSA agent considers the cliff edge more dangerous since it makes decisions and makes adjustments to its behavior by a stochastic policy. That's because it's learned that jumping from the cliff usually results in a very bad outcome for it. However, Q-learning may fall off the cliff.
Conclusion
Q-learning may be able to quickly and accurately learn the best policy because it uses a greedy action selection technique to maximize reward. However, it may not converge to the best solution as it can fall off the cliff. But SARSA can employ an exploration step in the second phase and may arrive at an optimal solution without falling off the cliff. However, this can also mean that Q-learning may arrive at the best policy more quickly than SARSA.