Desel P. Dorji - Fab Futures - Data Science
Home About

Probability¶

This notebook is in three parts:

  1. Investigating the probability distribution
  2. Fitting a probability distribition
  3. Density estimation

Investigating the probability distribution of my data¶

Again, started the assignment by asking Gemini the following prompts so I can get a better grasp of what I'm expected to do:

  1. With the goal of "quantifying uncertainty", what does it mean to investigate the probability distribution of my data?
  2. What are the most appropriate functions and/or representations for investigating (not fitting) the probability distribution of my data? Please provide explanations for why.

Response 1:¶

To "investigate the probability distribution" means to stop looking at your data as single, absolute numbers (e.g., "50 deaths") and start looking at them as ranges of likelihood (e.g., "Usually around 50, but sometimes as low as 40 or as high as 60").

In the context of "Quantifying Uncertainty," it helps you move from saying "This will happen" to "This is how sure we are that this will happen."

When you fit a line to your data, you are finding the Average Trend. But real life doesn't follow the average perfectly. It scatters around the line.

  • The Point Estimate: "I predict 55 cases." (This is your Trend Line).
  • The Probability Distribution: "How much did we miss by in the past?"
  • Did we usually miss by 1 or 2? (Tight distribution = High Certainty).
  • Did we sometimes miss by 50? (Wide distribution = High Uncertainty).

Response 2:¶

To "investigate" the probability distribution means to perform Exploratory Data Analysis (EDA). You are acting like a detective looking for clues about the "personality" of your data before you try to force it into a mathematical box.

Since you are analyzing trends (Time vs. Deaths), you cannot just investigate the raw numbers (because 2022 is naturally higher than 2018). Instead, you must investigate the Residuals (the errors).

Here was the most helpful part of the response:

For a very small dataset like yours ($N=5$), standard tools like Histograms are actually bad because they group data into "bins," hiding detail. Here are the most appropriate representations for small data:

  1. The Residual Scatter Plot (Residuals vs. Time). Why it is appropriate: It checks for patterns. If your dots form a U-shape, your straight line model is wrong. If the dots fan out (getting wider), your uncertainty is increasing over time.
  2. The Strip Plot (or Rug Plot). Why it is appropriate: With only 5 data points, a histogram is blocky and misleading. A strip plot shows you the exact position of every error, revealing if they are clumped together or spread out.
  3. The Box Plot. Why it is appropriate: It instantly flags Outliers. If one dot is far outside the box, you know that year was statistically abnormal.

The Most Appropriate Functions (Math)

  • Since you are "investigating" and not yet "fitting," you use Descriptive Statistics (functions that describe what is, not what should be).
  • Mean of Residuals ($\mu$): Should be Zero. If not, the model is biased.
  • Standard Deviation ($\sigma$): Measures the "Spread." A high number means high uncertainty.
  • Skewness: Measures "Lopsidedness."0 = Symmetrical (Normal).
  • Positive = The model underestimates huge spikes (rare disasters). Why: Vital for disease data, which often spikes high (outbreaks) but rarely drops below zero.

Starting the assignment:¶

In [1]:
import pandas as pd
import numpy as np
In [2]:
# As always, loading my file
df = pd.read_csv("datasets/Mortality cases3.csv")
df.head(5)
Out[2]:
ICD10 CODE Name of the Disease Type of disease 2018 2019 2020 2021 2022
0 A02á´³ Diarrhoea Infectious 6.0 6.0 2.0 NaN 2.0
1 A03á´³ Dysentery Infectious NaN NaN NaN NaN NaN
2 A15á´³ Tuberculosis Infectious 22.0 20.0 20.0 31.0 17.0
3 A41á´³ Other Sepsis, including Septicaemia Infectious 62.0 46.0 52.0 32.0 45.0
4 A50 Congenital Syphilis Infectious NaN NaN NaN NaN NaN
In [3]:
# Variable for grouping my data 
years = ['2018', '2019', '2020', '2021', '2022']

# Trimming whitespace:
df['Type of disease'] = df['Type of disease'].str.strip()

# Editing my data since some of the categories became confusing to work with:
df['Type of disease'] = df['Type of disease'].replace('Infectious', 'Infectious Diseases')

# New table! Grouping my data by the preferred category and summing the years for total deaths per category per year
df_grouped = df.groupby('Type of disease')[years].sum() # This is from Gemini

I asked Gemini the following prompt:

Given my data, please explain step-by-step how I can construct

  1. A Residual Scatter Plot
  2. A Strip Plot, and
  3. A Box Plot on my Jupyter notebook.

The following code was taken from Gemini, with some edits made personally.

Just to note, I have mainly been sticking to the use of plotly so far because to start with, the LLMs recommended its use - but when I tried to graph similar things using matplotlib, I find that the plotly graphs are more suited to my taste in terms of their interactivity. It doesn't always carry over to the live website, but on the notebook at least provides some features that are at least helpful for me.

In [4]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

def create_diagnostic_plots(category_name):
    # --- PART 1: PREPARE THE DATA & MATH ---
    # Safety Check
    if category_name not in df_grouped.index:
        print(f"Error: {category_name} not found.")
        return

    # Get X (Years) and Y (Actual Deaths)
    y_actual = df_grouped.loc[category_name].values
    x_years = np.array([2018, 2019, 2020, 2021, 2022])

    # Fit Linear Model (y = mx + c)
    coef = np.polyfit(x_years, y_actual, 1)
    poly_fn = np.poly1d(coef)

    # Calculate Residuals (The "Error")
    # Residual = Reality - Prediction
    y_predicted = poly_fn(x_years)
    residuals = y_actual - y_predicted

    # --- PART 2: CREATE THE CANVAS ---
    # We create a figure with 3 separate slots (subplots)
    fig = make_subplots(
        rows=1, cols=3,
        subplot_titles=("1. Residual Scatter Plot", "2. Strip Plot", "3. Box Plot"),
        column_widths=[0.5, 0.25, 0.25] # Make the first plot wider
    )

    # --- PART 3: PLOT 1 - RESIDUAL SCATTER PLOT ---
    # Goal: See if errors change over time (Time Pattern)
    fig.add_trace(go.Scatter(
        x=x_years, y=residuals,
        mode='markers', 
        marker=dict(size=10, color='blue'),
        name='Residual'
    ), row=1, col=1)

    # Add a red "Zero Line" (The target)
    fig.add_trace(go.Scatter(
        x=[2018, 2022], y=[0, 0],
        mode='lines',
        line=dict(color='red', dash='dash'),
        name='Zero Line'
    ), row=1, col=1)

    # --- PART 4: PLOT 2 - STRIP PLOT ---
    # Goal: See the raw spread of errors (Clustering)
    # We use a Box plot hack: make the box invisible so only dots show
    fig.add_trace(go.Box(
        y=residuals,
        boxpoints='all',      # Show all dots
        jitter=0,             # Don't scatter them left/right
        pointpos=0,           # Put them right in the center
        fillcolor='rgba(0,0,0,0)', # Invisible Box!
        line=dict(color='rgba(0,0,0,0)'), # Invisible Lines!
        marker=dict(color='green', size=10),
        name='Strip'
    ), row=1, col=2)

    # --- PART 5: PLOT 3 - BOX PLOT ---
    # Goal: See the summary stats (Median, Outliers)
    fig.add_trace(go.Box(
        y=residuals,
        marker=dict(color='orange'),
        name='Distribution'
    ), row=1, col=3)

    # --- PART 6: POLISH & SHOW ---
    fig.update_layout(
        title=f"Uncertainty Diagnostics: {category_name}",
        showlegend=False, # Hide legend to keep it clean
        template="plotly_white"
    )
    fig.show()

# Run it
create_diagnostic_plots('Infectious Diseases')
No description has been provided for this image

Residual Scatter Plot (Left):

  • What it is: Plotting the error against the Year.
  • Why we need it: To check for Time Patterns.
  • Example: If the dots go Down-Down-Up-Up, your line is wrong (it should be a curve).
  • Ideal: Random bouncing around the red zero line.

Strip Plot (Middle):

  • What it is: All the errors plotted on a single vertical line.
  • Why we need it: To see the Density.
  • Example: Are the dots clumped together (predictable) or spread far apart (risky)?

Box Plot (Right):

  • What it is: A summary box showing the Median (middle line) and Range.
  • Why we need it: To find Outliers.
  • Example: If one dot is way outside the box whiskers, that year was a freak anomaly.

Now these results are obviously just for Infectious Diseases, and the process could be repeated to see the reliability of the data on each type of disease. However, with some help from Gemini ("Please help me interpret these plots"), I have noted my understanding of each one below. (Many of these are not so much understandings as they are further questions...)

  1. The residual scatter plot
  • If my model had been perfect at predicting deaths each year, the dots would be along the dotted red line -- however here, we can see a mix of over and underpredictions.
  • Possible implications: since it is a mix of over and underpredictions, it could be positive in the sense that the model doesn't have a noticeable bias, but I think just highlights a lack of sufficient data to draw any significant conclusions.
  1. The strip plot
  • Another way to look at the residuals, presented in a manner that I personally find less appealing than the residual scatter plot since any residual values that are close together may be difficult to interpret.
  • Possible implications: upon further reflection, my prior doubt is sort of nullified because an overall glance gives you a sense of the spread of the residuals + any potential outliers.
  1. The box plot
  • Centre, spread, and skew, three words I haven't seen for a long time - my understanding is that the median being closer to zero means that the model is fairly reliable, despite the residuals. There is a slight negative skew, but it doesn't seem too significant. The fact that there are no significant outliers seems to be a good start for the model, but again, the flaws in the data are a little too large to ignore.

Just to try my own hand at probability distributions without the LLM support, I used the following video as a reference just to experiment with my data.

In [5]:
import IPython.display 
from IPython.display import YouTubeVideo
In [6]:
id = 'uial-2girHQ'
YouTubeVideo(id=id, width=900, height=400)
Out[6]:

After watching the whole video and trying out some of the suggested code, I still felt most comfortable with what I'd done previously with Gemini -- just keeping this one here as a note, especially since I don't normally use matplotlib and haven't yet used scipy.stats.

In [7]:
import matplotlib.pyplot as plt
import scipy.stats as stats
In [11]:
pd.DataFrame(df_grouped).plot(kind="density",
                              figsize=(5,5),
                              xlim=(-1,11));
No description has been provided for this image

Fitting a probability distribution to my data¶

Having looked at a brief investigation of my data's probability distribution, I then asked Gemini the following question:

Now that I've investigated the probability distribution of my data, I would like to fit a probability distribution to my data.

  1. In simple terms, what does this mean?
  2. What would the best type of distribution be for my data, and why?

What I wanted to highlight from the Gemini response here was this:

Honorable Mention (For "Extra Credit"): Since you only have 5 data points, a Student’s t-distribution is theoretically more accurate. It looks like a Bell Curve but is "shorter and fatter," representing the fact that with only 5 years of history, extreme surprises are more likely. However, for most school projects, the Normal Distribution is the expected standard.

I had only ever worked with normal distributions before, so this was new to me -- and it seemed relevant given my 5 data points per disease.

I then put in this prompt:

Please

  1. Show me how to use my residuals to calculate the mean and standard deviation
  2. Using these data points, show me how to draw a t-distribution for my data

The following work is from that prompt.

In [17]:
# 1. Get your data
y_actual = df_grouped.loc['Infectious Diseases'].values
x_years = np.array([2018, 2019, 2020, 2021, 2022])

# 2. Calculate the Trend Line (Linear Model)
coef = np.polyfit(x_years, y_actual, 1)
predicted = np.poly1d(coef)(x_years)

# 3. Calculate Residuals (The Errors)
residuals = y_actual - predicted

# 4. Calculate the Stats
res_mean = np.mean(residuals)  # The Average Error
res_std = np.std(residuals, ddof=1) # The Spread (Standard Deviation)

print(f"Mean of Residuals: {res_mean:.2f}")
print(f"Standard Deviation: {res_std:.2f}")
Mean of Residuals: 0.00
Standard Deviation: 4.85

Note: I wanted to keep this above section of code just to understand how to calculate the mean of SDs ... a mean of 0 seems pretty good here, but a standard deviation of 4.85 doesn't seem ideal given my data figures are not all that high. Although "predicting" numbers of death seems unideal for such a short time frame.

Also, with the code below, I managed to define and work on the plot_t_distribution function myself -- small wins!

In [26]:
import plotly.graph_objects as go
from scipy.stats import t
import numpy as np

# 1. DEFINING THE FUNCTION
def analyse_t_distribution(category_name):
    # Safety check, as always
    if category_name not in df_grouped.index:
        print("Error: Name not found")
        return

    # Geting my data
    y_actual = df_grouped.loc[category_name].values
    x_years = np.array([2018, 2019, 2020, 2021, 2022])

    # Fitting a Linear Model to get residuals
    coef = np.polyfit(x_years, y_actual, 1)
    predicted = np.poly1d(coef)(x_years)
    residuals = y_actual - predicted

    # Calculating the stats, as done 2 cells above
    res_mean = np.mean(residuals)
    res_std = np.std(residuals, ddof=1)
    
    print(f"--- Statistics for {category_name} ---")
    print(f"Mean: {res_mean:.2f}")
    print(f"Std Dev: {res_std:.2f}")

# 2. PLOTTING
def plot_t_distribution(category_name):
    if category_name not in df_grouped.index:
        print("Error: Name not found")
        return
         
    # Setting up the curve - drawing the curve from "Mean - 4 Sigmas" to "Mean + 4 Sigmas"
    x_curve = np.linspace(res_mean - 4*res_std, res_mean + 4*res_std, 100)
    
    # Calculating the Shape: df (Degrees of Freedom) = Number of Data Points - 1
    degrees_of_freedom = len(residuals) - 1 
    
    y_curve = t.pdf(x_curve, df=degrees_of_freedom, loc=res_mean, scale=res_std)
    
    # 3. Drawing the Plot
    fig = go.Figure()
    
    # A. The Histogram (My actual errors)
    fig.add_trace(go.Histogram(
        x=residuals, 
        histnorm='probability density', # Normalising height to match the curve
        name='Actual Residuals',
        marker_color='gray', opacity=0.5
    ))
    
    # B. The T-Distribution (The Math Model itself) -- still need to understand this!
    fig.add_trace(go.Scatter(
        x=x_curve, y=y_curve,
        mode='lines',
        name=f'T-Distribution (df={degrees_of_freedom})',
        line=dict(color='purple', width=3)
    ))
    
    # C. The Mean Line (Center)
    fig.add_trace(go.Scatter(
        x=[res_mean, res_mean], y=[0, max(y_curve)],
        mode='lines',
        line=dict(color='red', dash='dash'),
        name=f'Mean ({res_mean:.2f})'
    ))
    
    fig.update_layout(title=f"T-Distribution of Errors: {category_name}", template="plotly_white")
    fig.show()

# 3. CALL THE FUNCTIONS
analyse_t_distribution('Infectious Diseases')
plot_t_distribution('Infectious Diseases')
--- Statistics for Infectious Diseases ---
Mean: 0.00
Std Dev: 4.85
No description has been provided for this image

Density Estimation¶

Asking Gemini about this was tough, given that I don't even know how to talk about density estimation. My prompt was, What is the difference between fitting a probability distribution to my data, and finding (?) the density estimation of my data?

It provided a pretty long response, but I found this to be most helpful:

  • Fitting a Probability Distribution (Parametric): You assume the data has a specific, simple shape (like a Bell Curve), and you just calculate the size.
  • Density Estimation (Non-Parametric): You assume nothing. You let the data draw its own shape, no matter how weird or lumpy it looks.

I also noticed that, based on what I've read/seen so far, my density estimation is going to look even weirder given that I have only 5 data points per disease...

In [27]:
import plotly.graph_objects as go
from scipy.stats import norm, gaussian_kde
import numpy as np

def compare_fit_vs_density(category_name):
    if category_name not in df_grouped.index:
        return

    # 1. GET RESIDUALS
    y_actual = df_grouped.loc[category_name].values
    x_years = np.array([2018, 2019, 2020, 2021, 2022])
    coef = np.polyfit(x_years, y_actual, 1)
    residuals = y_actual - np.poly1d(coef)(x_years)

    # 2. FITTING (Parametric)
    # Assuming it's a Bell Curve, calculating and using Mean/SD 
    mu, std = norm.fit(residuals)
    x_range = np.linspace(min(residuals)-10, max(residuals)+10, 100)
    y_fit = norm.pdf(x_range, mu, std)

    # 3. METHOD B: DENSITY ESTIMATION (Non-Parametric / KDE)
    # Just smooth out the data points into a curve
    kde = gaussian_kde(residuals)
    y_density = kde(x_range)

    # 4. PLOT
    fig = go.Figure()

    # The Histogram (The Raw Data)
    fig.add_trace(go.Histogram(
        x=residuals, histnorm='probability density',
        name='Data Histogram', marker_color='lightgray', opacity=0.5
    ))

    # The Fitted Line (Red)
    fig.add_trace(go.Scatter(
        x=x_range, y=y_fit, mode='lines',
        name='Fitted Distribution (Assumption)',
        line=dict(color='red', width=3, dash='dash')
    ))

    # The Density Line (Green)
    fig.add_trace(go.Scatter(
        x=x_range, y=y_density, mode='lines',
        name='Density Estimation (The Reality)',
        line=dict(color='green', width=3)
    ))

    fig.update_layout(title=f"Fitting vs. Density Estimation: {category_name}")
    fig.show()

compare_fit_vs_density('Infectious Diseases')
No description has been provided for this image

The produced plot made sense given the explanation of fitting vs. density estimation. My logic is that it makes sense given the 5 data points used to generate this density estimation...

Below were (and kind of still are) some remaining questions I have, which have been partly addressed by Gemini.

  1. What is gaussian KDE?

  2. Please provide further explanation for why it is better to plot the residuals, not the data points themselves, when looking at density estimation.

  3. Overall, all my probability calculations have looked at a single category_name at a time -- theoretically, how should my approach change if I wanted to compare across the types of diseases rather than focusing on just one?

I was most concerned about question 3, and am still trying to understand this. Gemini suggsted using a violin plot, which is shown below, but I'm not convinced. Maybe I could categorise diseases into those with more or less uncertainty, or identify other ways to sort this data in a way that it tells a story... But for now, I'll leave this here.

Note:¶

I have started to break down the code below for my own understanding, but other than changing the height and a few small edits, I haven't fully understood it yet.

In [31]:
residual_data = []

for category in df_grouped.index:
    # 1. Get Data for this specific disease
    y_actual = df_grouped.loc[category].values
    x_years = np.array([2018, 2019, 2020, 2021, 2022])
    
    # 2. Fit Linear Model
    coef = np.polyfit(x_years, y_actual, 1)
    predicted_y = np.poly1d(coef)(x_years)
    
    # 3. Calculate Residuals
    residuals = y_actual - predicted_y
    
    # 4. Store them
    # We add 5 rows (one for each year) to our list
    for i in range(len(residuals)):
        residual_data.append({
            'Disease Type': category,
            'Year': x_years[i],
            'Residual': residuals[i]
        })

# Convert the list into a clean Pandas DataFrame
all_residuals_df = pd.DataFrame(residual_data)

# Show the first few rows to verify
print("Success! Calculated residuals for all diseases.")
print(all_residuals_df.head())
Success! Calculated residuals for all diseases.
                                        Disease Type  Year  Residual
0  Certain Conditions Originating in the Perinata...  2018       1.2
1  Certain Conditions Originating in the Perinata...  2019      -5.0
2  Certain Conditions Originating in the Perinata...  2020     -10.2
3  Certain Conditions Originating in the Perinata...  2021      30.6
4  Certain Conditions Originating in the Perinata...  2022     -16.6
In [34]:
import plotly.express as px

# Create the Violin Plot
fig = px.violin(
    all_residuals_df, 
    y="Residual", 
    x="Disease Type", 
    box=True,       # Draws a box plot inside the violin
    points="all",   # Shows the actual dots
    hover_data=['Year'] # Lets you see which year had the big error
)

fig.update_layout(
    height=1200,  # Sets the height to 800 pixels (standard is ~450)
    width=1000,  # Optional: Makes it wider too if the text is squashed
    title="Uncertainty Comparison: Residuals Across All Diseases",
    yaxis_title="Residual (Error in Number of Deaths)",
    template="plotly_white"
)

fig.show()
No description has been provided for this image
In [35]:
fig = px.box(
    all_residuals_df, 
    x="Disease Type", 
    y="Residual",
    points="all", # Show the actual dots
    hover_data=['Year']
)

# Add a red line at 0 (The Target)
fig.add_hline(y=0, line_dash="dash", line_color="red", annotation_text="Perfect Prediction")

fig.update_layout(
    title="Uncertainty Analysis: Box Plot of Errors",
    yaxis_title="Error (Residual)",
    height=800, # Taller chart
    template="plotly_white"
)

fig.show()
No description has been provided for this image