How one YouTuber trained AI to Play Video Games with Reinforcement Learning

7 min readNov 16, 2023

My first Christmas present in Canada was a used Game Boy Advance and a copy of Pokemon Red. Though my understanding of video games was limited and english even more so, I managed to beat the game after a long year of trial and error. So when I came across Peter Whidden¹ and how he trained AI to play Pokemon, I couldn’t help but draw parallels about the process of learning. Like my earlier self, the AI program had no knowledge of the game and just pressed random buttons. However after 20,000 games and about 5 years of simulated game time, it manages to not only catch Pokemon, battle other opponents, but even defeat a gym leader.

Pokemon Red is a Japanese role-playing game that was publised by Nintendo in 1996, with a goal of defeating the 8 gym leaders and the Pokemon League, to become the new champion. To aid players in this quest, they are able to catch monsters called Pokemon to defeat other Pokemon trainers.

Creating Goals

Peter used Proximal Policy Optimization² as the reinforcement learning algorithm to optimize AI choices. Rather than explicit instructions on what buttons to press, both positive and negative feedback is given on overarching objectives to help the program decide the best course of action. The objective rewards include:

Small rewards for exploration
Medium rewards for raising Pokemon levels
Large rewards for defeating gym leaders

The AI would have 2 hour long games where it would press random buttons, then update the learning agent to prioritize the decisions that resulting in the most reward points. After repeated sessions, the program should show improvement with each subsequent version.

Awarding Curiosity

The AI keeps a record of every unique screen it sees, so when the current screen doesn’t match its record, it receives a reward for exploring a new area. By program Version 5, the AI manages to leave the starting area (pictured above), even faster than before. However, Peter comments that the “novelty seeking behaviour” that helps game progression can also act as adistraction. In the first area, Pallet Town, the AI lingers around a pond area with Non-Player Characters, water and grass rather than continuing to explore.

Peter shows an example of what counts as a unique screen

A Non-Player Character (NPC) is any game character that isn’t controlled by a player. They can take many actions including: offering gameplay tips, selling items, or battling with the player, among others.

The animation of the game environment is enough to trigger the exploration reward many times, resulting in the program being distracted instead of completing its objectives. Peter mentions that during the learning process, it is much easier to change the intrinsic motivation of AI by changing reward values, compared to the more complex movations of people. He ended up changing the unique screen threshold to a few hundred pixels, so the animations don’t trigger the exploration reward passively. Modifying reinforcement values requires the AI to restart from Version 0 as a clean slate for reproducibility. Fortunately, the solution functions as intended and the AI manages to reach the next destination, Viridian City by Version 8.

Encouraging Growth

However just exploration is not sufficient to complete the game. The screen when battling Pokemon does not fulfill the unique screen criteria, so the AI chooses to run away, as it does not gain an exploration reward. This poses an issue because the AI can’t gain levels and stats for their Pokemon to defeat stronger trainers later in the game. Peter first tried to penalize the AI for losing Pokemon battles. Unfortunately, instead of avoiding difficult battles, the AI simply didn’t press any buttons when it was about to lose, resulting in a standstill.

Levelling Pokemon: Each trainer can have a maximum of 6 Pokemon at once. Pokemon are awarded experience points (EXP) when defeating Pokemon, and with enough EXP, Pokemon gain levels that raise their stats. Eventually when it reaches a certain level, most Pokemon evolve into stronger forms.

Peter shows the similarly looking battle screens. Pokemon battle to increase in level, stats.

Instead, Peter decides to implement a medium reward for the combined levels of all the Pokemon in their party, to encourage the program to grow stronger. With this incentive, the AI battles, catches Pokemon until they reach a higher level. The AI eventually makes its way to next area, Viridian Forest by Version 60 and wins their first trainer battle on their first attempt. The maze-like forest suprisingly doesn’t pose a challenge to the program, implementing an interesting counter-clockwise path to navigate to the next area, Pewter City by Version 65.

Overcoming Challenges

One observation that Peter made at the midpoint of the project is that the AI suprisingly avoids Pokemon Centers, which is detrimental to progression.

A Pokemon Center (PC) is a building that restores the health of Pokemon. Pokemon faint when they lose all their hit points (HP) during a battle. When all Pokemon in the party faint, the player must restart from the last visited PC or the starting area. Players can also store their Pokemon at the Center as they can only hold a maximum of 6 at once.

Peter shows an example of a Pokemon Center.

He pinpoints this situation to a singular instance of a large and irregular deduction of reward points. As the level reward was coded to be based on the sum of Pokemon levels, when the AI randomly deposited a Level 13 Pokemon at the Center, it resulted in a loss of equivalent reward points. Peter comments that though algorithms don’t have emotions, he compared this incident to “trauma”, a strong emotional response deriving from a distressing event. This negative association from just one game had such a great impact, that all future iterations of the program avoided the Pokemon Center, despite it being an integral part of game progression. Peter overcomes this challenge through modifying the level award to be given when the change in levels increase, rather than the total sum of all Pokemon levels. Afterwards, he reset the program to the first iteration again.

Pokemon Battles: Each pokemon can have up to 4 attack or action moves, each with a type such as ‘water’, ‘fire’, ‘normal’ and so on. Certain moves are ‘super effective’ depending on the type of Pokemon. Each move can only be used a certain amount of times until it needs to be recharged at a Pokemon Center.

The ultimate trial for this project was to defeat the Pewter City gym leader, Brock, who has stronger Pokemon than the other trainers that the AI had faced up to this point. The algorithm had an innate bias to use the first selectable move, which is an ineffective normal type move against the sturdy rock type Pokemon of the gym leader. Failing to defeat Brock over and over again, it begins to avoid the Pokemon gym in most future iterations.

Peter’s AI catches a lucky break by using a water type move.

However, after 300 days of simulated play time on Version 100, the program catches a lucky break and defeats Brock. In that game, the AI runs out of the normal type move, so it used a water type move ‘Bubble’ instead, which was super effective on rock type Pokemon. The algorithm decides that using ‘Bubble’ as its default move is the most optimal, and defeats Brock more consistently in subsequent games. Though the game is far from over, Peter decides that defeating the first gym leader is an ideal milestone for reviewing the state of the project and the reinforcement learning algorithm.

Improving Algorithms

Peter suggested a few modifications to improve future algorithms. He decided to simplify the model as much as possible to avoid minor complications, such as starting the program from the laboratory where the player receives their Pokemon, instead of the player’s house, or choosing a water type Pokemon which is strong against the first gym leader. Additionally for the sake of efficiency, he made use of the PyBoy emulator which could run games at 20x speed. Playing one iteration would take 2 hours in game time, but only 6 minutes in real time.

Peter shows the pathing of early models in red, intermediate in light green, and advanced in blue.

From this project, we discover insights about how AI learns through reinforcement and how difficult it is to create a result without direct intervention. Peter could only influence the AI decisions only by modifying the training data or algorithm for exploration and leveling, rather than giving instruction to press specific buttons. From getting distracted by minor details, or being emotionally impacted by distressing situations, and even making a great discovery by a stroke of luck — it was fascinating to see that there were similarities between humans and AI in their learning process.

Witnessing a fascinating project by Peter Whidden leaves me with the same thoughts I had when I finally beat Pokemon Red — I can’t wait for what comes next.

[1] Whidden, P. (2023, Oct. 8). Training AI to Play Pokemon with Reinforcement Learning. YouTube.

[2] van Heeswijk, W. (2022, Nov. 29). Proximal Policy Optimization (PPO) Explained. Medium.

How one YouTuber trained AI to Play Video Games with Reinforcement Learning

Written by qlawk

No responses yet