Friday, August 25, 2017

Reinforcement Learning Overview

There are basically 3 different types of Machine Learning
  • Supervised Learning:  The major use case is Prediction.  We provide a set of training data including the input and output, then train a model that can predict output from an unseen input.
  • Unsupervised Learning:  The major use case is Pattern extraction.  We provide a set of data that has no output, the algorithm will try to extract the underlying non-trivial structure within the data.
  • Reinforcement Learning:  The major use case is Optimization.  Mimicking how human learn from childhood, we use a trial and error approach to find out what actions will produce good outcome, and bias our preference towards those good actions.
In this post, I will provide an overview of the settings of Reinforcement Learning as well as some of its key algorithms.

Agent / Environment Interaction

Reinforcement Learning is all about how we can make good decision through trial and error.  It is the interaction between the "agent" and the "environment".  

Repeat the following steps until reaching a termination condition
  1. The agent observe the environment having state s
  2. Out of all possible actions, the agent need to decide which action to take.  (this is called "policy", which is a function that output an action given the current state)
  3. Agent take the action, and the environment receive that action
  4. Through a transition matrix model, environment determine what is the next state and proceed to that state
  5. Through a reward distribution model, the environment determines the reward to the agent given he take action a at state s

The goal for the agent is to determine an optimal policy such that the "value" of the start state is maximized.

Some terminology
  • Episode:  a sequence of (s1, a1, r1, s2, a2, r2, s3, a3, r3 .... st, at, rt ... sT, aT, rT)
  • Reward rt:  Money the agent receive after taking action at a state at time t
  • Return:  Cumulative reward since the action is taken (sum of rt, r[t+1], ... rT)
  • Value:  Expected return at a particular state, called "state value" V(s), or expected return when taking action a at state s, called "Q Value" Q(s,a)
The optimal policy can be formulated as choosing action a* amount all choices of a at state s such that Q(s, a*) is maximum.

To deal with never ended interaction, we put a discount factor "gamma" on future reward.  This discount factor will turn the sum of an infinite series into a finite number.

Optimal Policy when model is known

If we know the "model", then figuring out the policy is easy.  We just need to use dynamic programming technique to compute the optimal policy offline and there is no need for learning.  

Two algorithms can be used: 

"Value iteration" starts with a random value and iteratively update the value based on the Bellman's equation, and finally compute the "value" of each state or state/action pair (also call Q state). The optimal policy for a given state s is to choose the action a* that maximize the Q value, Q(s, a). 

Another algorithm "Policy iteration" starts with a random policy, and iteratively modifies the policy to make it better, until the policy at next iteration doesn't change any more.

However, in practice, we usually don't know the model, so we cannot compute the optimal policy as described above.

Optimal Policy when model is unknown

One solution is the "model based" learning, we spare some time to find out the transition probability model as well as the reward distribution model.  To make sure we experience all possible combinations of different state/action pairs, we will take random action in order to learn the model.

Once we learn the model, we can go back to use the value iteration or policy iteration to determine the optimal policy.

Learning has a cost though.  Rather than taking the best action, we will take random action in order to explore new actions that we haven't tried before and it is very likely that the associated reward is not maximum.  However we accumulate our knowledge about how the environment reacts under a wider range of scenarios and hopefully this will help us to get a better action in future.  In other words, we sacrifice or trade off our short term gain for a long term gain.  

Making the right balance is important.  A common approach is to use the epsilon greedy algorithm.  For each decision step, we allocate a small probability e where we take random action and probability (1-e) where we take the best known action we have explored before.

Another solution approach is the "model free" learning.  Lets go back to look at the detail formula under Value iteration and Policy iteration, the reason of knowing the model is to calculate the expected value of state value and Q value.  Can we directly figure out the expected state and Q value through trial and error ?

Value based model free learning

If we modify the Q value iteration algorithm to replace the expected reward/nextstate with the actual reward/nextstate, we arrive at the SARSA algorithm below.

Deep Q Learning 

The algorithm above requires us to keep a table to remember all Q(s,a) values which can be huge, and also becomes infinite if any of the state or action is continuous.  To deal with this, we will introduce the idea of value function.  The state and action will become the input parameters of this function, which will create "input features" and then feed into a linear model and finally output the Q value.

Now we modify the previous SARSA algorithm to the following ...

  • Instead of lookup the Q(s,a) value, we call the function (can be a DNN) to pass in the f(s, a) feature, and get its output
  • We randomly initialize the parameter of the function (can be weights if the function is a DNN)
  • We update the parameters using gradient descent on the lost which can be the difference between the estimated value and the target value (can be a one step look ahead estimation: r + gamma*max_a'[Q(s',a)] )

If we further generalize the Q value function using a deep neural network, and update the parameter using back propagation, then we reach a simple version of Deep Q Learning.

While this algorithm allow us to learn the Q value function which can represents a continuous state, we still need to evaluate every action and pick the one with the maximum Q value.  In other words, the action space can only be discrete and finite.

Policy gradient

Since the end goal is to pick the right action, and finding out the Q value is just the means (so we can pick the action of maximum Q), why don't we learn a function that takes a state and directly output an action.  Using this policy function approach, we can handle both continuous or discrete action space as well.

The key idea is to learn a function (given a state, output an action)

  • If the action is discrete, it outputs a probability distribution of each action
  • It the action is continuous, it output the mean and variance of the action, assume normal distribution
The agent will sample from the output distribution to determine the action, so its chosen action is stochastic (nondeterministic).  Then the environment will determine the reward and next state.  Cycle repeats ...

The goal is to find the best policy function where the expected value of Q(s, a) is maximize.  Notice that s and a are random variable parameterized by θ.

To maximize an "expected value" of a function with parameters θ, we need to calculate the gradient of that function.

Actor Critic Algorithm

There are 2 moving targets in this equation:

  • To improve the policy function, we need an accurate estimation of Q value and also need to know the gradient of log(s, a)
  • To make the Q value estimation more accurate, we need a stable policy function
We can break down these into two different roles
  • An actor, whose job is to improve the policy function by tuning the policy function parameters
  • A critic, whose job is to fine tune the estimation of Q value based on current (incrementally improving) policy
The "actor critic" algorithm is shown below.

Then we enhance this algorithm by adding the following steps
  • Replace the Q value function with an Advantage function, where A(s, a) = Q(s, a) - Expected Q(s, *).  ie:  A(s, a) = Q(s, a) - V(s)
  • Run multiple thread Asynchronously
This is the state of the art A3C algorithm.

Learning resources and credits

Some of the algorithms I discussed above is extracted from the following sources

Saturday, July 15, 2017

Regression model outputting probability density distribution

For a classification problem (let say output is one of the labels R, G, B), how do we predict ?

There are two formats that we can report our prediction
  1. Output a single value which is most probable outcome.  e.g. output "B"  if P(B) > P(R) and P(B) > P(G)
  2. Output the probability estimation of each label.  (e.g. R=0.2, G=0.3, B=0.4)
But if we look at regression problem (lets say we output a numeric value v), most regression model only output a single value (that minimize the RMSE).  In this article, we will look at some use cases where outputting a probability density function is much preferred.

Predict the event occurrence time

As an illustrative example, we want to predict when would a student finish her work given she has already spent some time s.  In other words, we want to estimate E[t | t > s] where t is a random variable representing the total duration and s is the elapse time so far.

Estimating time t is generally hard if the model only output an expectation.  Notice that the model has the same set of features, expect that the elapse time has changed in a continuous manner as time passes.

Lets look at how we can train a prediction model that can output a density distribution.

Lets say our raw data schema: [feature, duration]
  • f1, 13.30
  • f2, 14.15
  • f3, 15.35
  • f4, 15.42
Take a look at the range (ie. min and max) of the output value.  We transform into the training data of the following schema:
[feature, dur<13, dur<14, dur<15, dur<16]
  • f1, 0, 1, 1, 1
  • f2, 0, 0, 1, 1
  • f3, 0, 0, 0, 1
  • f4, 0, 0, 0, 1
After that, we train 4 classification model.
  • feature, dur<13
  • feature, dur<14
  • feature, dur<15
  • feature, dur<16

Now, given a new observation with corresponding feature, we can invoke these 4 model to output the probability of binary classification (cumulative probability).  If we want the probability density, simply take the difference (ie: differentiation of cumulative probability).

At this moment, we can output a probability distribution given its input feature.

Now, we can easily estimate the remaining time from the expected time in the shade region.  As time passed, we just need to slide the red line continuously and recalculate the expected time, we don't need to execute the prediction model unless the input features has changed.

Predict cancellation before commitment 

As an illustrative example, lets say a customer of restaurant has reserved a table at 8:00pm.  Time now is 7:55pm and the customer still hasn't arrive, what is the chance of no-show ?

Now, given a person (with feature x), and current time is S - t (still hasn't bought the ticket yet), predict the probability of this person watching the movie.

Lets say our raw data schema: [feature, arrival]
  • f1, -15.42
  • f2, -15.35
  • f3, -14.15
  • f4, -13.30
  • f5, infinity
  • f6, infinity
We transform into the training data of the following schema:
[feature, arr<-16, arr<-15, arr<-14, arr<-13]
  • f1, 0, 1, 1, 1
  • f2, 0, 1, 1, 1
  • f3, 0, 0, 1, 1
  • f4, 0, 0, 0, 1
  • f5, 0, 0, 0, 0
  • f6, 0, 0, 0, 0
After that, we train 4 classification models.
  • feature, arr<-16
  • feature, arr<-15
  • feature, arr<-14
  • feature, arr<-13
Notice that P(arr<0) can be smaller than 1 because the customer can be no show.

In this post, we discuss some use cases where we need the regression model to output not just its value prediction but also the probability density distribution.  And we also illustrate how we can build such prediction model.

Sunday, July 2, 2017

How AI differs from ML

AI is not a new term, it is multiple decades old starting around early 80s when computer scientist design algorithms that can "learn" and "mimic human behavior".

On the "learning" side, the most significant algorithm is Neural Network, which is not very successful due to overfitting (the model is too powerful but not enough data).  Nevertheless, in some more specific tasks, the idea of "using data to fit a function" has gained significant success and this form the foundation of "machine learning" today.

On the "mimic" side, people have focus in "image recognition", "speech recognition", "natural language processing", experts have been spending tremendous amount of time to create features like "edge detection", "color profile", "N-grams", "Syntax tree" ... etc.  Nevertheless, the success is moderate.

Traditional Machine Learning

Machine Learning (ML) Technique has played a significant role in prediction and ML has undergone multiple generations, with a rick set of model structure, such as

  • Linear regression
  • Logistic regression
  • Decision tree
  • Support Vector Machine
  • Bayesian model
  • Regularization model
  • Ensemble model
  • Neural network

Each of these predictive model is based on certain algorithmic structure, with parameters as tunable knobs.  Training a predictive model involves the following

  1. Choose a model structure (e.g. Logistic regression, or Random forest, or ...)
  2. Feed the model with training data (with both input and output)
  3. The learning algorithm will output the optimal model (ie: model with specific parameters that minimize the training error)

Each model has its own characteristics and will perform good in some tasks and bad in others.  But generally, we can group them into the low-power (simple) model and the high-power (complex) model.  Choose between different models is a very tricky question.

Traditionally, using a low power / simple model is preferred over the use of a high power / complex model for the following reasons

  • Until we have massive processing power, training the high power model will take too long
  • Until we have massive amount of data, training the high power model will cause the overfit problem (since the high power model has rich parameters and can fit into a wide range of data shape, we may end up train a model that fits too specific to the current training data and not generalized enough to do good prediction on future data).

However, choosing a low power model suffers from the so called "under-fit" problem where the model structure is too simple and unable to fit the training data in case it is more complex.  (Imagine the underlying data has a quadratic relationship: y = 5 * x^2, there is no way you can fit a linear regression: y = a*x + b no matter what a and b we pick).

To mitigate the "under-fit problem", data scientist will typically apply their "domain knowledge" to come up with "input features", which has a more direct relationship with the output.  (e.g. Going back to the quadratic relationship: y = 5 * square(x), if you create a feature z = x^2, then you can fit a linear regression: y = a*z + b, by picking a = 5 and b = 0)

The major obstacle of "Machine Learning" is this "Feature Engineering" step which requires deep "domain experts" to identify important signals before feeding into training process.  The feature engineering step is very manual and demands a lot of scarce domain expertise and therefore become the major bottleneck of most machine learning tasks today.

In other words, if we don't have enough processing power and enough data, then we have to use the low-power / simpler model, which requires us to spend significant time and effort to create appropriate input features.  This is where most data scientists spending their time doing today.

Return of Neural Network

At early 2000, machine processing power has increased tremendously, with the advancement of cloud computing, massively parallel processing infrastructure, together with big data era where massive amount of fine grain event data being collected.  We are no longer restricted to the low-power / simple model.  For example, two most popular, mainstream machine learning model today are RandomForest and Gradient Boosting Tree.  Nevertheless, although both of them are very powerful and provide non-linear model fitting to the training data, data scientist still need to carefully create features in order to achieve good performance.

At the same time, computer scientists has revisited the use of many layers Neural Network in doing these human mimic tasks.  This give a new birth to DNN (Deep Neural Network) and provide a significant breakthrough in image classification and speech recognition tasks.  The major difference of DNN is that you can feed the raw signals (e.g. the RGB pixel value) directly into DNN without creating any domain specific input features.  Through many layers of neurons (hence it is called "deep" neural network), DNN can "automatically" generate the appropriate features through each layer and finally provide a very good prediction.  This saves significantly the "feature engineering" effort, a major bottleneck done by the data scientists.

DNN also evolves into many different network topology structure, so we have CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), LSTM (Long Short Term Memory), GAN (Generative Adversarial Network), Transfer Learning, Attention Model ... etc.  The whole spectrum is called Deep Learning, which is catching the whole machine learning community’s attention today.

Reinforcement Learning

Another key component is about how to mimic a person (or animal) learn.  Imagine the very natural animal behavior of perceive/act/reward cycle.  A person or animal will first understand the environment by sensing what "state" he is in.  Based on that, he will pick an "action" which brings him to another "state".  Then he will receive a "reward".  The cycle repeats until he dies.  This way of learning (called "Reinforcement Learning") is quite different from the "curve fitting" approaches of traditional supervised machine learning approach.  In particular, learning in RL is very fast because every new feedback (such as perform an action and receive a reward) is sent immediately to influence subsequent decisions.  Reinforcement Learning has gain tremendous success in self-driving cars as well as AlphaGO (Chess Playing Robot).

Reinforcement Learning also provides a smooth integration between "Prediction" and "Optimization" because it maintains a belief of current state and possible transition probabilities when taking different actions, and then make decisions which action can lead to the best outcome.

AI = DL + RL

Compare to the classical ML Technique, DL provide a more powerful prediction model that usually produce good prediction accuracy.  Compare to the classical Optimization model using LP, RL provide a much faster learning mechanism and also more adaptive to change of the environment.

Saturday, April 29, 2017

An output of a truly random process

Recently I have a discussion with my data science team whether we can challenge the observations is following a random process or not.  Basically, data science is all about learning hidden pattern that is affecting the observations.  If the observation is following a random process, then there is nothing we can learn about.  Let me walk through an example to illustrate.

Lets say someone is making a claim that he is throwing a fair dice (with number 1 to 6) sequentially.

Lets say I claim the output of my dice throw is uniformly random, ie: with equal chances of getting a number from 1 to 6.

And then he throws the dice 12 times, and show you the output sequence.  From the output, can you make a judgement whether this is really a sequential flow of a fair dice ?  In other words, is the output really follow a random process as expected ?

Lets look at 3 situations
  • Situation 1 output is [4, 1, 3, 1, 2, 6, 3, 5, 5, 1, 2, 4]
  • Situation 2 output is [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
  • Situation 3 output is [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6]
At first glance, the output of situation 1 looks like resulting from a random process.  Situation 2 definitely doesn't look like it.  Situation 3 is harder to judge.  If you look at the proportion of the output numbers, the frequency of each output number of situation 3 definitely follows a uniform distribution of a fair dice.  But if you look at the number ordering, situation 3 follows a well-defined ordering that doesn't seem to be random at all.  Therefore, I don't think the output of situation 3 is following a random process.

However, this seems to be a very arbitrary choice.  Why would I look at the number ordering at all ? Should I look for more properties ?  such as ...
  • Whether the number of the even position are even
  • Average gap between consecutive throws
  • Whether the number in the 3rd position always smaller than the 10th position
  • ...
As you can see, depends on my imagination, the list can go on and on.  How can I tell whether situation 3 is following a random process or not ?

Method 1: Randomization Test

This is based on the hypothesis testing methodology.  We establish null hypothesis H0 that situation 3 follows a random process.

First, I define an arbitrary list of statistics of my choices
  • statisticA = proportion of even numbers in even position
  • statisticB = average gap between consecutive output numbers
  • statisticC = ...
Second, I run a simulation to generate 12 numbers based on a random process.  Calculate the corresponding statistics defined above.

Third repeat the simulation for N times, output the mean and standard deviation of the statistics.

If the statisticA or B or C of situation 3 are too far away (based on the likelihood pValue) from the mean of statistics A/B/C by the number of standard deviation of statistics A/B/C, then we conclude that situation 3 is not following a random process.  Otherwise, we don't have enough evidence to show our null hypothesis is violated and so we accept situation 3 follows the random process.

Method 2: Predictability Test

This is based on the theory of predictive analytics.

First, I pick a particular machine learning algorithm, lets say time series forecast using ARIMA.
Notice that I can also choose to use RandomForest and create some arbitrary input features (such as previous output number, maximum number in last 3 numbers ... etc)

Second, I train my selected predictive model based on the output data of situation 3 (in this example, situation 3 has only 12 data point, but imagine we have much more than 12 data point).

Third, I evaluate my model in the test set.  And see whether the prediction is much better than a random guess.  For example I can measure the lift of my model by comparing the RMSE (root mean square error) or my prediction and the standard deviation of the testing data.  If the lift is very insignificant, then I conclude that situation 3 results from a random process, because my predictive model doesn't learn any pattern.