Google Introduces Reinforcement Learning Framework

6 years ago

Last week, Google released its new reinforcement learning (RL) framework, Dopamine, based on its machine learning library, Tensorflow. The search giant has open sourced the innovative new framework to GitHub where it is now openly available.

“Inspired by one of the main components in reward-motivated behavior in the brain and reflecting the strong historical connection between neuroscience and reinforcement learning research, this platform aims to enable the kind of speculative research that can drive radical discoveries,” Pablo Samuel Castro and Marc G. Bellemare, researchers on the Google Brain Team, wrote in a blog post. “This release also includes a set of colabs that clarify how to use our framework.”

Google’s reinforcement learning framework was built to overcome some of the challenges of existing RL frameworks; they are typically unstable, inflexible and take time to master a goal. With its new Tensorflow-based framework, the goal of the team at Google Brain was to introduce a system that provides “flexibility, stability and reproducibility” for both new and experienced RL researchers.

What is Machine Learning?

Let’s pull back for a moment and first examine what reinforcement learning is in the context of machine learning (ML). Machine learning is a type of data analysis that automates analytical model building. It is founded on the idea that systems can learn from the input of data, spot patterns and arrive at decisions with minimal human intervention. The iterative nature of ML is important as the repetition of exposure to data allows machines to independently adapt. They learn from previous computations to come up with repeatable, reliable decisions.

Everyday examples of ML include online recommendations from streaming companies, fraud detection and autonomous cars.

Reinforcement learning is one of three types of machine learning (ML) algorithm:

(i) supervised learning – where you tell the machine what it is seeing by feeding labeled data into a neural network;

(ii) unsupervised learning – there is no pre-determined goal and the data it is fed is unlabeled; the machine unearths structural aspects of the data as it learns;

(iii) reinforcement learning – a type of supervised learning in which partial information is provided to the model, but it is not given a correct answer or action. RL uses rewards and/or punishments to drive agents in the pursuit of certain goals.

What is the Difference between Supervised and Unsupervised Learning?

Supervised learning is used to solve challenges like customer segmentation, to predict the likelihood of purchasing decisions, for churn prediction, etc. Supervised learning techniques include linear regression, random forest, support vector machines.

Unsupervised learning is used to work for tasks like facial recognition, analysis of speech patterns or language translation. Unsupervised learning techniques include clustering, k-means and neural networks. If you are employing unsupervised learning, you typically have lots of data and want a machine to help accelerate the speed at which you can determine interesting patterns and arrive at a solution.

What is Reinforcement Learning?

Reinforcement learning (RL) is a type of supervised learning in which the machine is only given partial information in order to examine how software agents should take actions in an environment in order to maximize some kind of cumulative reward. Reinforcement learning helps robots learn the essentials of longer-term reasoning drawn from experience so that they can make complex and robust decisions.

Organizations that have to regularly work on large complex problems typically use RL. Other scenarios in which it is particularly effective include dealing with large state spaces, instances in which simulations are being used rather than trial or error due to scale or risk issues, or in areas in which assistance is needed to help augment human expertise.

RL models learn via a continuous process of receiving a reward or punishment on each action taken; this means they don’t need a lot of preexisting data or knowledge to offer effective solutions, making this methodology particularly effective at training systems to respond to unexpected or unforeseen environments.

Reinforcement learning has shown itself to be a powerful force in the gaming world. It trained the systems which overcame the world champions of AlphaGo two years running and beat Valve’s Dota 2.

When combined with deep learning (another ML technique that can help robots acquire skills from experience and excels at allowing robots to handle unstructured real-world scenarios), Google says there is “the potential to enable robots to learn continuously from their experience, allowing them to master basic sensorimotor skills using data rather than manual engineering”.

There are three kinds of reinforcement learning in play currently:

(i) End-to-end (deep) reinforcement learning – the kind of RL used at DeepMind (as below). This approach extends RL across the entire process ranging from observation to action by using a deep network without explicitly designing state or action space;

(ii) Inverse reinforcement learning – in which the reward function is inferred rather than given based on observed behavior from a human expert. The goal is to mimic observed behavior which has been deemed optimal or near to optimal;

(iii) Apprenticeship learning – an expert demonstrates the target behavior and the system attempts to recover the policy via observation of the expert.

Google’s Work in Reinforcement Learning

DeepMind’s DQN

At Google, reinforcement learning is also a central component to its AI subsidiary DeepMind’s deep Q-network (DQN). Google acquired the University College London startup for a reported £400 million in 2014, four years after its founding by Demis Hassabis, Shane Legg and Mustafa Suleyman.

As of February 2015, DeepMind’s DQN was the first DeepRL system to combine Deep Neural Networks with reinforcement learning at scale. Its prowess was demonstrated through the “superhuman” mastery of a wide group of Atari 2600 games using only the raw pixels and score as inputs.

Until then, neural researchers had only been able to create individual algorithms capable of mastering a single specific domain. The DQN algorithm, however, made it possible to prove that a novel end-to-end reinforcement learning agent (named a DQN) could surpass the performance of a professional human reference players as well as all previous agents over a wide range of 49 different game scenarios.

DeepMind describes DQN as “the first demonstration of a general-purpose agent that is able to continually adapt its behavior without any human intervention, a major technical step forward in the quest for general AI”.

Other DeepMind projects include more recently, AlphaGo, AlphaGo Zero, and Open AI Five systems, all of which have proven their ability to beat human experts at high-level games. DeepMind has also created a Neural Turing Machine, a neural network that might be able to access an external memory such as a conventional Turin machine, leading to a machine that is able to mimic the short-term memory of the human brain.

The New Tensorflow-Based Framework: Dopamine

The new Tensorflow-based framework has arrived out of a range of critical advances in the field of reinforcement learning over the last few years, particularly within Google. These include distributional methods that have allowed agents to model full distributions instead of just their expected values to learn a more total picture of their world; large-scale distributed training, which uses asynchronous gradient descent for optimization of deep neural network controllers; and Deepmind’s introduction of replay memories in DQN, which has enabled the leveraging of previous agent experience. The algorithms that have led to these advances are also applicable in other domains like robotics.

Part of Bellemare and Castro’s intent with the new Dopamine framework was to disrupt existing ways of developing RL systems, which frequently require “quickly iterating over a design – often with no clear direction”. In the recent blog post on the new RL framework, the Google Brain researchers point out the ways in which flaws in this kind of methodology can actually slow down the development process and limit exploration. Their new package aims instead to offer ease of use, tools that enable reproducibility, and a host of training data to allow new researchers to quickly test out their ideas against existing methods.

As stated above, the Google Brian team constructed the new Dopamine reinforcement framework with three specific goals in mind: flexibility, stability and reproducibility in order to enable researchers “to iterate on RL methods effectively, and thus explore new research directions that may not have immediately obvious benefits”. Reproducing the results from given frameworks often takes a great deal of time, which can negatively impact on scientific reproducibility issues further down the line.

With those goals in mind, Dopamine comprises a concise set of well-documented code (around 15 Python files) focused on the Arcade Learning Environment (a platform for using video games to evaluate AI technology) and four different machine learning models: C51; a simplified version of the Rainbow agent; the Implicit Quantile Network; and DeepMind’s DQN. To achieve reproducibility, the code is offered with full test coverage and training data (in JSON and Python pickle formats) over all 60 games that the Arcade Learning Environment supports. The full training data of the four provided agents is provided to help researchers benchmark their ideas against established methods.

Google has also launched a website for developers to visualize training runs for the four agents on all 60 games. It is also making available trained models, raw statistics logs, and TensorFlow event files that can be used for plotting with TensorBoard, Google’s suite of visualization tools for TensorFlow programs.

“Our hope is that our framework’s flexibility and ease-of-use will empower researchers to try out new ideas, both incremental and radical,” Bellemare and Castro wrote. “We are already actively using it for our research and finding it is giving us the flexibility to iterate quickly over many ideas. We’re excited to see what the larger community can make of it.