FAB Futures - Data Science
Home About

Research > Expectation Maximization¶

Tutorial 1¶

EM Algorithm in Machine Learning

Latent Variables for Maximum Likelihood

  • how to estimate the joint probability distribution for a dataset
  • "Probability Density Estimation is the construction of an estimate based on observed data, involving the selection of a probability distribution function and the parameters of that function that best explains the joint probability of the observed data."

    No description has been provided for this image

InĀ [16]:
import matplotlib.pyplot as plt
import numpy as np
from numpy.random import normal
InĀ [36]:
np.random.seed(0) # seed number 0
sample = normal(size=2000)
InĀ [37]:
plt.hist(sample,bins=100) # number of bins is important
Out[37]:
(array([ 2.,  0.,  0.,  2.,  3.,  0.,  2.,  3.,  2.,  1.,  2.,  1.,  4.,
         3.,  4.,  3.,  6.,  8.,  2., 10., 13., 11., 21., 16., 13., 21.,
        20., 20., 15., 22., 26., 25., 31., 37., 32., 43., 42., 36., 43.,
        33., 43., 47., 51., 51., 48., 39., 47., 51., 53., 67., 48., 47.,
        61., 43., 45., 55., 45., 47., 37., 31., 26., 38., 34., 35., 39.,
        29., 23., 19., 13., 18., 22., 22., 15., 15., 13., 10., 15., 11.,
         7.,  9.,  8.,  7.,  1.,  7.,  5.,  6.,  4.,  3.,  1.,  4.,  1.,
         2.,  2.,  1.,  0.,  0.,  0.,  0.,  0.,  1.]),
 array([-3.04614305e+00, -2.98397188e+00, -2.92180070e+00, -2.85962952e+00,
        -2.79745834e+00, -2.73528716e+00, -2.67311599e+00, -2.61094481e+00,
        -2.54877363e+00, -2.48660245e+00, -2.42443127e+00, -2.36226009e+00,
        -2.30008892e+00, -2.23791774e+00, -2.17574656e+00, -2.11357538e+00,
        -2.05140420e+00, -1.98923302e+00, -1.92706185e+00, -1.86489067e+00,
        -1.80271949e+00, -1.74054831e+00, -1.67837713e+00, -1.61620595e+00,
        -1.55403478e+00, -1.49186360e+00, -1.42969242e+00, -1.36752124e+00,
        -1.30535006e+00, -1.24317888e+00, -1.18100771e+00, -1.11883653e+00,
        -1.05666535e+00, -9.94494172e-01, -9.32322993e-01, -8.70151815e-01,
        -8.07980637e-01, -7.45809458e-01, -6.83638280e-01, -6.21467102e-01,
        -5.59295924e-01, -4.97124745e-01, -4.34953567e-01, -3.72782389e-01,
        -3.10611210e-01, -2.48440032e-01, -1.86268854e-01, -1.24097676e-01,
        -6.19264973e-02,  2.44680964e-04,  6.24158592e-02,  1.24587038e-01,
         1.86758216e-01,  2.48929394e-01,  3.11100572e-01,  3.73271751e-01,
         4.35442929e-01,  4.97614107e-01,  5.59785285e-01,  6.21956464e-01,
         6.84127642e-01,  7.46298820e-01,  8.08469999e-01,  8.70641177e-01,
         9.32812355e-01,  9.94983533e-01,  1.05715471e+00,  1.11932589e+00,
         1.18149707e+00,  1.24366825e+00,  1.30583942e+00,  1.36801060e+00,
         1.43018178e+00,  1.49235296e+00,  1.55452414e+00,  1.61669532e+00,
         1.67886649e+00,  1.74103767e+00,  1.80320885e+00,  1.86538003e+00,
         1.92755121e+00,  1.98972239e+00,  2.05189356e+00,  2.11406474e+00,
         2.17623592e+00,  2.23840710e+00,  2.30057828e+00,  2.36274946e+00,
         2.42492063e+00,  2.48709181e+00,  2.54926299e+00,  2.61143417e+00,
         2.67360535e+00,  2.73577653e+00,  2.79794770e+00,  2.86011888e+00,
         2.92229006e+00,  2.98446124e+00,  3.04663242e+00,  3.10880360e+00,
         3.17097477e+00]),
 <BarContainer object of 100 artists>)
No description has been provided for this image

Questions: Which probability distribution function to choose? How to choose the parameters for that function?

Solution: "Maximum Likelihood Estimation Algorithm = the method of estimating the parameters of a probability distribution by maximizing the likelihood function in order to make the observed data most probable for the statistical model"

Limitation: "Assumes the data is completely and fully observed and all variables relevant to the model are already present. But sometimes some variables remain hidden (Latent Variables) and cause inconsistencies."

Another Solution: *"Expectation Maximization Algorithm = finding appropriate distribution parameters in the presence of Latent Variables.

EM Algorithm In Machine Learning

"EM Algorithm is used to find the local maximum likelihood parameters of a statistical mode, in case Latent Variables are present or the dataset is incomplete."

The Procedure:

  1. Consider a set of Starting Parameters in the incomplete data
  2. Expectation Step to estimate the missing values in the data, using the observed data to guess the missing values.
    3.Maximization Step generates complete data by updating the missing value in the data
  3. Repeat steps 2 & 3...until Convergence (based on intuition, when 2 random variables and the probability of their difference is very small...if the values match each other) is met

How does EM Algorithm Work

  • Use Observed Data to estimate the missing data and then updating those values of parameter
  • A set of initial parameters is considered
  • A set of unobserved and incomplete data is given to the system
  • ...with an assumption that the observed data is coming from a specific model
  • Next the Expectation step (E-step)
  • Use the observed data to estimate the missing or incomplete value or data
  • Next the Maximization step (M-step) used to complete the data generated in the e-step...updates the hypothesis
  • Then Convergence Check

Gaussian Mixture Model

"Gaussian Mixture Model (GMM) is a model that uses a combination of probability distributions and and require the estimation of mean and standard deviation parameters"

  • Maximum Likelihood Estimation the most common method for estimating parameters for the GMM

Case: data points are generated by 2 different processes and each process has a Gaussian probability distribution...but unclear as to which distribution a given data point belongs to

No description has been provided for this image

  • In EM algo, the e-step estimates the expected value for each Latent Variable
  • The m-step would optimize the parameters of the distribution using the maximum likelihood

A new program to run the described case...

InĀ [83]:
# import needed libraries
from numpy import hstack
from numpy.random import normal 
import matplotlib.pyplot as plt
InĀ [84]:
# create 2 data point samples, different location and size
sample1 = normal(loc=20, scale=5, size=4000)
sample2 = normal(loc=40, scale=5, size=8000)
InĀ [85]:
# combine 2 samples into 1 with hstack command
sample = hstack((sample1, sample2))
InĀ [86]:
plt.hist(sample, bins=50,density=True) # density = true to change y-axis to probability density, not raw counts
Out[86]:
(array([7.03577536e-05, 1.40715507e-04, 7.03577536e-05, 0.00000000e+00,
        7.03577536e-05, 4.92504275e-04, 1.40715507e-03, 2.60323688e-03,
        3.16609891e-03, 5.41754703e-03, 7.24684862e-03, 1.13979561e-02,
        1.46344127e-02, 1.87855202e-02, 2.27255544e-02, 2.27959122e-02,
        2.71580929e-02, 2.53287913e-02, 2.46252138e-02, 2.50473603e-02,
        2.01926753e-02, 1.57601368e-02, 1.30865422e-02, 1.04129475e-02,
        1.00611588e-02, 1.06240208e-02, 1.45640550e-02, 1.68155031e-02,
        2.27255544e-02, 3.27867132e-02, 3.76413982e-02, 4.34107340e-02,
        4.78432724e-02, 5.38236815e-02, 5.30497462e-02, 5.29090307e-02,
        4.54511088e-02, 3.49678035e-02, 2.82838169e-02, 2.27255544e-02,
        1.58304946e-02, 1.16090293e-02, 6.96541761e-03, 3.37717217e-03,
        3.30681442e-03, 1.40715507e-03, 9.14650797e-04, 2.11073261e-04,
        2.11073261e-04, 1.40715507e-04]),
 array([-0.14568943,  1.03873346,  2.22315634,  3.40757922,  4.59200211,
         5.77642499,  6.96084787,  8.14527075,  9.32969364, 10.51411652,
        11.6985394 , 12.88296228, 14.06738517, 15.25180805, 16.43623093,
        17.62065382, 18.8050767 , 19.98949958, 21.17392246, 22.35834535,
        23.54276823, 24.72719111, 25.91161399, 27.09603688, 28.28045976,
        29.46488264, 30.64930553, 31.83372841, 33.01815129, 34.20257417,
        35.38699706, 36.57141994, 37.75584282, 38.9402657 , 40.12468859,
        41.30911147, 42.49353435, 43.67795724, 44.86238012, 46.046803  ,
        47.23122588, 48.41564877, 49.60007165, 50.78449453, 51.96891741,
        53.1533403 , 54.33776318, 55.52218606, 56.70660894, 57.89103183,
        59.07545471]),
 <BarContainer object of 50 artists>)
No description has been provided for this image

Observations:

  • Expected Distribution > peak for Sample 1 at 20, peak for Sample 2 at 40
  • ...points in the middle of the distribution, unclear which distribution they belong to

We will use the Gaussian Mixture Model to find out. ChatGPT will be used to explain code lines.

InĀ [Ā ]:
! pip install scikit-learn
InĀ [87]:
# import the gaussian mixture method of scikit learn
from sklearn.mixture import GaussianMixture
InĀ [88]:
# reshape the combined distribution from 1D array to 2D column vector array
# scikit-learn needs 2D data 
sample = sample.reshape((len(sample),1)) #parameters are sample length and single value samples 
InĀ [94]:
# Create Gaussian Mixture Model with 2 Gaussian components and Random initialization
model = GaussianMixture(n_components=2, init_params='random') # specify that 2 distributions involved and means and covariances are initialized randomly
InĀ [95]:
# generate a fit function for the GMM
model.fit(sample)
Out[95]:
GaussianMixture(init_params='random', n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
n_componentsĀ  2
covariance_typeĀ  'full'
tolĀ  0.001
reg_covarĀ  1e-06
max_iterĀ  100
n_initĀ  1
init_paramsĀ  'random'
weights_initĀ  None
means_initĀ  None
precisions_initĀ  None
random_stateĀ  None
warm_startĀ  False
verboseĀ  0
verbose_intervalĀ  10
InĀ [96]:
# Predict the Latent Values from the combined Distribution 
y1 = model.predict(sample)
InĀ [97]:
# Check Latent Values for the first and last points
# Easiest way to identify which distribution the points belong to
print(y1[:80]) # check for first 80 points...tend to be with the left most distribution
print(y1[-80:]) # check for last 80 points...tend to be with the right most distribution
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1
 1 1 1 1 1 1]

Observations:
I needed ChatGPT to help me to interpret the final output. This is what it said:

  • Zeroes represents identification of data points in one cluster
  • Ones represents identification of data points in the other cluster

So the output suggests:

  • The first 80 points identifies as a mixture between both distributions
  • The last 80 points identifies as firmly belonging to the right most distribution

Application of EM Algorithm

  • Machine Learning and Computer Vision
  • Natural Language Processing
  • Quantitative Genetics
  • Psychometrics

Advantages and Disadvantages

Advantages

  • Guaranteed that likelihood will increase with each iteration
  • During implementaton, the e-step and m-step are very easy for many problems
  • The solution for m-step often exist in closed form

Disadvantages

  • EM algorithm has a very low convergence
  • It makes the convergence to the local optima only
  • Requires both forward and backward probabilities...a setback when it comes to algorithm in machine learning

Increasing Iterations

I asked ChatGPT how to increase the number of iterations for the GMM. It returned: GMM normally runs 100 EM iterations only

To increase the number of iterations using the Scikit-Learn method, an extra parameter must be added.

I increase the maximum iterations to 500.

InĀ [98]:
# reshape the combined distribution from 1D array to 2D column vector array
# scikit-learn needs 2D data 
sample = sample.reshape((len(sample),1)) #parameters are sample length and single value samples 
InĀ [99]:
# Create Gaussian Mixture Model with 2 Gaussian components and Random initialization
model = GaussianMixture(n_components=2, init_params='random', max_iter=500) # specify that 2 distributions involved and means and covariances are initialized randomly
InĀ [100]:
# generate a fit function for the GMM
model.fit(sample)
Out[100]:
GaussianMixture(init_params='random', max_iter=500, n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
n_componentsĀ  2
covariance_typeĀ  'full'
tolĀ  0.001
reg_covarĀ  1e-06
max_iterĀ  500
n_initĀ  1
init_paramsĀ  'random'
weights_initĀ  None
means_initĀ  None
precisions_initĀ  None
random_stateĀ  None
warm_startĀ  False
verboseĀ  0
verbose_intervalĀ  10
InĀ [101]:
# Predict the Latent Values from the combined Distribution 
y1 = model.predict(sample)
InĀ [102]:
# Check Latent Values for the first and last points
# Easiest way to identify which distribution the points belong to
print(y1[:80]) # check for first 80 points...tend to be with the left most distribution
print(y1[-80:]) # check for last 80 points...tend to be with the right most distribution
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0]
[1 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1
 1 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 0 1 0 1 1 1 0 0 0 1 1 1 1
 0 1 1 0 1 1]

Wow!!! A dramatic increase in the result from when it was only 100 iterations!!

Now, the first set of values are identified to clearly belong to the left-most distribution, while the last set of values are identified to clearly belong to the right-most distribution.

Let's see if increasing the iterations to 1000 will make any difference?

InĀ [78]:
# reshape the combined distribution from 1D array to 2D column vector array
# scikit-learn needs 2D data 
sample = sample.reshape((len(sample),1)) #parameters are sample length and single value samples 
InĀ [79]:
# Create Gaussian Mixture Model with 2 Gaussian components and Random initialization
model = GaussianMixture(n_components=2, init_params='random', max_iter=300) # specify that 2 distributions involved and means and covariances are initialized randomly
InĀ [80]:
# generate a fit function for the GMM
model.fit(sample)
Out[80]:
GaussianMixture(init_params='random', max_iter=300, n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
n_componentsĀ  2
covariance_typeĀ  'full'
tolĀ  0.001
reg_covarĀ  1e-06
max_iterĀ  300
n_initĀ  1
init_paramsĀ  'random'
weights_initĀ  None
means_initĀ  None
precisions_initĀ  None
random_stateĀ  None
warm_startĀ  False
verboseĀ  0
verbose_intervalĀ  10
InĀ [81]:
# Predict the Latent Values from the combined Distribution 
y1 = model.predict(sample)
InĀ [82]:
# Check Latent Values for the first and last points
# Easiest way to identify which distribution the points belong to
print(y1[:80]) # check for first 80 points...tend to be with the left most distribution
print(y1[-80:]) # check for last 80 points...tend to be with the right most distribution
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0 0
 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0
 0 0 0 0 1 0]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1]