FAB Futures - Data Science
Home About

Research > Probability & Python¶

Deconstructing Neil's Programs¶

In [2]:
print("hello")
hello
In [ ]:
! pip install numpy
In [ ]:
! pip install matplotlib
In [ ]:
# Neil's Histogram Code

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(10)
npts = 1000
mean = 1
stddev = 2
#
# generate Gaussian samples
#
x = np.random.normal(mean,stddev,npts)
#
# plot histogram and points
#
plt.hist(x,bins=npts//50,density=True)
plt.plot(x,0*x,'|',ms=npts/20)
#
# plot Gaussian
#
xi = np.linspace(mean-3*stddev,mean+3*stddev,100)
yi = np.exp(-(xi-mean)**2/(2*stddev**2))/np.sqrt(2*np.pi*stddev**2)
plt.plot(xi,yi,'r')
plt.show()
No description has been provided for this image

Deconstructing Neil's Histogram Code¶

In [8]:
# Import Libraries  

import numpy as np
import matplotlib.pyplot as plt
In [9]:
# Specify a Random Seed Value

np.random.seed(10) #seed value 10
In [10]:
# Define Variables 

npts = 1000 # number of points
mean = 1 # mean value
stddev = 2 # standard deviation value, width
In [12]:
# Generate Gaussian samples
# Random points in a Bell-Shaped distribution

x = np.random.normal(mean,stddev,npts)
In [18]:
# Plot Histogram

plt.hist(x,bins=npts//50,density=True)
Out[18]:
(array([0.00339887, 0.00339887, 0.0050983 , 0.00849717, 0.03568812,
        0.05268246, 0.08327227, 0.12405869, 0.14615133, 0.19883379,
        0.22092644, 0.22092644, 0.19203606, 0.13255586, 0.11726095,
        0.0747751 , 0.02549151, 0.02209264, 0.02039321, 0.01189604]),
 array([-5.40880269, -4.82037152, -4.23194036, -3.64350919, -3.05507803,
        -2.46664686, -1.8782157 , -1.28978453, -0.70135337, -0.1129222 ,
         0.47550896,  1.06394013,  1.65237129,  2.24080246,  2.82923362,
         3.41766479,  4.00609596,  4.59452712,  5.18295829,  5.77138945,
         6.35982062]),
 <BarContainer object of 20 artists>)
No description has been provided for this image
In [14]:
# Plot Points

plt.plot(x,0*x,'|',ms=npts/20)
Out[14]:
[<matplotlib.lines.Line2D at 0x19d89cfbd90>]
No description has been provided for this image
In [ ]:
# Plot Gaussian Fit Function Line

# Two Calculations > X and Y points
xi = np.linspace(mean-3*stddev,mean+3*stddev,100)
yi = np.exp(-(xi-mean)**2/(2*stddev**2))/np.sqrt(2*np.pi*stddev**2)

# Plotting the Results
plt.plot(xi,yi,'r')
plt.show()
No description has been provided for this image

OK! Not too bad!

  1. Import Numpy and Matplotlib
  2. Specify Random Seed Value and define 3 Variables
  3. Use the 'normal' command to generate a Histogram distribution of points...plot it
  4. Plot the point locations (x-axis)
  5. Calculate and plot the Gaussian Fit Function line

So to plot my own Histogram, I would just need to replace the data used in step 3 with my own data. Now to find some bell-shaped data...

Python Probability Tutorials¶

Tutorial 1¶

Matplotlib Tutorial: Histograms

Visualization of data that falls within boundaries. Group data into bins based on frequency.

In [ ]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

plt.style.use('fivethirtyeight')

# data 
ages = [18, 19, 21, 25, 26, 26, 30, 32, 38, 45,55]
mean_age = np.mean(ages) # calculate mean

# histogram
bins = [10,20,30,40,50,60]
plt.hist(ages, bins = bins, edgecolor = 'black')
plt.axvline(mean_age, color='red', label='Mean') # plot mean as vertical red line

plt.title('test')
plt.xlabel('x')
plt.ylabel('y')

plt.tight_layout()

plt.show()
print(mean_age)
No description has been provided for this image

Tutorial 2¶

The 6 Must-Know Statistical Distributions Made Easy

Normal Distribution Continuous numerical data values whose frequency are symmetrically grouped around the Mean value. The spread of values to either side of the mean value is represented by the Standard Deviation.

T Distribution
Similar to Normal Distribution. Designed to work with samples of data not full population of data, particularly when sample size is small (few data). T-distribution accounts for uncertainty due to small sample size > Degree of Freedom.

Binomial Distribution
Plots samples of 2 (binary) outcomes. Probability of all combination of outcomes when 2 choices are avaiable.

Bernoulli Distribution
A special case of the Binomial Distribution with only 2 possible distribution types, true or false.

Uniform Distribution
All outcomes are equally likely to occur. Flat or uniform distribution among outcomes.

Poisson Distribution
Non-symmetrical, bounded between zero and infinity. Based on expected number of events per time unit.

Tutorial 3¶

Mutual Information

Mutual Information, Clearly Explained

Notes:

  • Variables influencing predictions are called 'Features'
    • Note: a variable without variance is not useful as a feature
  • Want to simplify data collection by remove unnecessary variables (those with little influence)
  • R-squared (Regression) could be use to determine relationship between variables and preditions...but R-squared only works with Continuous Variables (not Discrete Variables like booleans, etc.)
  • Mutual Information...used to discover relationship between a Discrete Variable and Predictions
  • Summation of Joint Probabilities divided by Marginal Probabilities
  • Joint Probabilities are the probability of 2 things occurring simultaneously
  • Marginal Probabilities are the probability of 1 thing occurring (typically recorded at the outer most row and column or 'margin' of the table)
  • equation uses Natural Log as a convention
  • A larger calculated Mutual Information value indicates greater influence of variable to prediction
  • A Histogram must be create of a Continuous Value (turned into Discrete Values)...before it can be used in a Mutual Information calculation
  • Mutual Information is similar to Entropy...both are sums of probabilities times logs
  • MI can be derived from Entropy