I started re-reading this book (Drive – Daniel Pink) recently on what really motivates people and how our paradigms of learning, fulfillment and intrinsic rewards are changing rapidly. It reminded me of reinforcement learning (in the context of ML and AI). A concept I have attempted to deconstruct below.
Machine Learning (ML) focuses on creating intelligent programs or ‘agents’ through the process of ‘learning’.
What is Reinforcement Learning?
- Reinforcement Learning (RL) is one approach of ML in which the agent learns through interaction with its environment and through the results of the interactions.
- This is essentially a mimicry of how humans learn [side note – quoting Alan Turing – “Instead of trying to produce a program to simulate the adult mind, why not rather try to produce one which simulates the child’s?]
- In this, the algorithm allows the software agents to automatically determine the ‘ideal behavior’ within a specific context that ‘maximizes its performance‘ – “Reinforcement learning is concerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward.” [Ref]
- No specific goals – The algorithms thus arent provided with explicit goals and end up learning them through trial and error (much like how we, as humans learn)
- Lets talk examples and use cases – Reinforcement systems can be understood from the perspective of games which come with a clear objective that is awarded points. As shown in this example, say the mouse is looking at getting to a piece of cheese at the end of a maze (+1000 points) and gets an intermediate award of +10 points on the way (for say, water). There is also negative reinforcement in the form of electric shocks (-100 points). After exploring around a bit, the mouse may find a bunch of smaller rewards in form of water – but if it sticks around there it may miss the bigger award of the cheese at the end. It thus has to decide on a trade-off between exploration/exploitation (much like decisions we puzzle over). One strategy might be to take the best known action for a majority of the time (80%+) but occasionally explore newer options or directions in this case that may result in an unknown reward (even it if means walking away from the known and available reward i.e water in this case) [ Here is a paper from UCL related to ‘exploitation vs exploration’ but that deep-dive is for another article] – This is referred to as the ‘epsilon-greedy strategy‘. Here epsilon refers to the percent of time that the agent takes a randomly selected action rather than the action that is most likely to maximize the reward given what it knows so far (that is the 20% minority route). We usually start with a high value of epsilon (i.e, a large % of time look at taking the unknown initially/exploration) but over time the mouse (or us) learn about the maze and learn about which results in a long term reward and reduce epsilon steadily (i.e – prefer the routes that are known) until it reaches 10% or lesser to exploiting what it already knows.
To summarize, RL is a ‘goal oriented’ algorithm where the learning happens not from a training set but from the agent’s interaction with the environment.
RL is considered to be the hope and driving force behind true AI given its immense learning potential.
Here is an immensely astounding (and slightly alarming) video of Google’s DeepMind AI that taught itself how to walk without any prior guidance.
So how is RL different from other Machine Learning? – RL does not have existing training data (like in most Supervised learning) and the agent learns from experience. It collects the training examples as it encounters them – through trial and error as it attempts them (‘this action was good, this was bad’) with the aim of maximizing long term reward. Here is a classification of different types of ML from Analytics Vidhya
It is thus clearly seen that unlike ‘Supervised Learning’, RL doesn’t have an ‘external supervisor’ which has and shares knowledge of the environment with the agent to complete the task. And unlike in ‘Unsupervised Learning’, RL has a mapping from the input to the output which is not present in Unsupervised learning. Unsupervised learning focuses on learning ‘underlying patterns’ in the provided training data. RL creates the data and learns through its attempts at maximizing the end reward.
- The actions thus can be summarized as follows – The “cause and effect” idea can be translated into the following steps for an RL agent:
- The agent observes an input state
- An action is determined by a decision making function (policy)
- The action is performed
- The agent receives a scalar reward or reinforcement from the environment
- Information about the reward given for that state / action pair is recorded
Where is RL used? – RL problems are potentially those that can be solved without external supervision – complex problems where there appears to be no obvious or easily programmable solutions. Two main areas where the currently applications span are –
- Game Playing – Think Chess
- Control Problems – Say, Elevator scheduling
More details related to the use cases, deep-dive of the algorithms and techniques are covered in part 2 of the post.
Here are some additional reading links-
- http://www.cse.unsw.edu.au/~cs9417ml/RL1/introduction.html
- https://www.techopedia.com/definition/32055/reinforcement-learning
- https://www.analyticsvidhya.com/blog/2017/01/introduction-to-reinforcement-learning-implementation/
- https://www.dropbox.com/s/e38nil1dnl7481q/machine_learning.pdf?dl=0
- https://www.analyticsvidhya.com/blog/2016/12/getting-ready-for-ai-based-gaming-agents-overview-of-open-source-reinforcement-learning-platforms/
- http://rll.berkeley.edu/deeprlcourse-fa15/docs/2015.08.26.Lecture01Intro.pdf
- http://rll.berkeley.edu/deeprlcourse-fa15/
- https://medium.com/machine-learning-for-humans/reinforcement-learning-6eacf258b265
- https://www.kdnuggets.com/2016/05/machine-learning-key-terms-explained.html/2
- http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/XX.pdf