Desel P. Dorji - Fab Futures - Data Science
Home About

Transforms¶

Analysing my data set:

I have taken Neil's comment very literally and to my advantage - for this assignment, I've stuck to understanding and using Principal Components Analysis (PCA).

Understanding PCA¶

After asking Gemini "What is Principal component analysis (PCA)?", I further referred to https://scikit-learn.org/stable/modules/decomposition.html#pca and asked Gemini to break down and simplify some of the language they used.

The following are some of my takeaways, copied from Gemini and rephrased for my own understanding:

Imagine you are trying to take a photo of a teapot.

  • The Problem: The teapot is a 3D object (complex), but your photo is 2D (simple).
  • The Goal: You want to find the best angle to take the photo so you can clearly see the handle, the spout, and the lid all at once.
  • Bad Angle: If you take a photo from directly above, it just looks like a circle. You lost the "teapot-ness."

PCA is the mathematical photographer that automatically rotates the object to find the angle that captures the most information (variance) in a flat picture.

PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. In scikit-learn, PCA is implemented as a transformer object that learns components in its fit method, and can be used on new data to project it on these components.(scikit-learn.org)

A. "Decompose Multivariate Dataset" What it means: You have a dataset with many variables (in your case, 5 years: 2018–2022). The Goal: To break this "complex 5-dimensional block of data" down into simpler, separate pieces (components). This part tripped me up a bit, but I think it means 5-dimensional in the sense that data the 5 years can't be plotted along 5 axes. B. "Orthogonal Components" What it means: "Orthogonal" is the fancy math word for Perpendicular (at a 90-degree angle).

Why it matters: When PCA finds the first trend (Component 1), the second trend (Component 2) is mathematically forced to be completely unrelated to the first one. If Component 1 represents "Overall rising deaths," Component 2 cannot be about rising deaths. It must be about something else (e.g., "Spiking in 2020 but falling in 2022"). This prevents redundancy. Each component gives you new information. C. "Maximum Variance" What it means: Variance = Information. The Process: PCA draws a line through your data. It rotates that line until the "spread" of the data points along that line is as wide as possible.

  • The first component would capture the "Main Story" (the biggest spread) -- this would also be the "Best Angle," a new line drawn through the data that captures the maximum amount of variation.

  • The second component would capture the "Secondary Story" (the next biggest spread) -- this would be the second-best angle, perpendicular to the first one.

One misunderstanding I had was in terms of the word "variance" - I thought it was initially referring to the variation between expected and actual data, but here it is actually referring to variance throughout the data itself rather than the residuals.


What I liked about reading up on PCA is that it seems to answer a lingering question I had from the previous assignment -- Gemini, in its response, highlighted the following point:

Right now, you are looking at diseases one by one. But what if you wanted to know: "Which diseases behave exactly the same way?" You could use PCA to group your diseases based on their trends over the 5 years.

This was exactly the question I had highlighted - so far, I've only been looking at my data in terms of the category_name variable, i.e types of diseases with five data points each.

My next question to Gemini was, "I want to use PCA so that I can compare three components - is this possible?"

This was their given suggestion:

PC1 (X-axis): The Main Trend. (Usually "Did deaths go up or down overall?").
PC2 (Y-axis): The Second Pattern. (e.g., "Did it spike in 2020 but stay flat otherwise?").
PC3 (Z-axis): The Third Nuance. (e.g., "Did it fluctuate wildly or stay steady?").

I actually couldn't fully understand the three components they'd highlighted, as it doesn't seem like it compares across diseases.

When I put in this next prompt, I want my three components to be:

  1. Diseases that are falling (e.g., mortality rates of X diseases are decreasing).
  2. Diseases that are rising (e.g., mortality rates of X diseases are increasing).
  3. Diseases that are unpredictable.

The response was:

  • You cannot "force" PCA to output exactly those three categories.
  • PCA is a blind mathematical machine. It defines "Components" based on Variance (Pattern Strength), not Semantics (Meaning). It doesn't know what "Falling" or "Unpredictable" means.

So how, then, does one interpret a PCA graph? (Prompt: If we can't predetermine the components, how does one interpret a PCA graph?)

  1. Look at Neighbors (Who sits together?) This is the most important step. Even if you don't know what the axes mean, proximity = similarity. The Cluster Rule: If "Infectious Diseases" and "Respiratory Diseases" are right next to each other on the graph, they have the exact same history.

The Outlier Rule: If "Covid-19" is floating all alone in the top-right corner, far away from the pack, it means its behavior was unique. No other disease followed that pattern.

  1. Look at the Center vs. The Edge The Center (0,0): This is the Average. Diseases here are "boring." They followed the standard national trend. The Edge: The further a point is from the center, the more extreme its behavior was regarding that specific component.

Finally, my question was how it arrived at the three components suggested above:

  1. Why PC1 is usually "Magnitude" or "Main Trend"
  • PCA looks for the biggest source of difference between your data points.
  • The Reality of Disease Data: The biggest difference between "Heart Disease" and "Rabies" isn't the trend; it's the Size.
  • Heart Disease = ~1,000 deaths/year.
  • Rabies = ~0 deaths/year.
  • The Math: When PCA sees this massive gap (0 vs 1,000), it dedicates the first component (PC1) to explaining it.
  • Result: Therefore, PC1 usually becomes a measure of "How big is this disease?"
  1. Why PC2 is usually "The Slope" (Rising vs. Falling)
  • Once PC1 explains the size difference, PCA looks for the next biggest difference.
  • The Reality: After size, the most obvious difference is direction. Some lines go up, some go down.
  • The Math: PCA finds a pattern where early years (2018) are negative and late years (2022) are positive (or vice versa).
  • Result: Therefore, PC2 usually captures "Is it getting better or worse?"
  1. Why PC3 is usually "The Curvature" (Spikes/Fluctuation)
  • After explaining Size (PC1) and Direction (PC2), what is left?
  • The Reality: The "wiggles." Maybe a disease was flat, then spiked in 2020 (Covid), then went flat again.
  • The Math: PCA looks for a U-shape or an inverted U-shape.
  • Result: Therefore, PC3 usually captures "Did something weird happen in the middle?"

The more I read about PCA, the more I feel I don't know. However, what little I understood I tried to apply below.

Data analysis using PCA¶

In [1]:
from sklearn.decomposition import PCA
import pandas as pd
import plotly.express as px
In [2]:
df = pd.read_csv("datasets/Mortality cases3.csv")

years = ['2018', '2019', '2020', '2021', '2022']
df['Type of disease'] = df['Type of disease'].str.strip()
df['Type of disease'] = df['Type of disease'].replace('Infectious', 'Infectious Diseases')
df_grouped = df.groupby('Type of disease')[years].sum()
In [3]:
# 1. RE-SHAPING THE DATA
# For PCA to work, data needs to be sorted that such that rows = diseases and columns = years
df_pivot = df_grouped.copy()

# 2. RUNNING PCA
# We want to squash the 5 years of history into 2 "Summary Dimensions"
pca = PCA(n_components=2)
components = pca.fit_transform(df_pivot)

# 3. CREATING A DATAFRAME FOR PLOTTING
pca_df = pd.DataFrame(data=components, columns=['PC1', 'PC2'])
pca_df['Disease Name'] = df_pivot.index

# 4. PLOT
fig = px.scatter(
    pca_df, x='PC1', y='PC2', 
    text='Disease Name',
    title="PCA: Which diseases behave similarly?",
    template="plotly_white"
)
fig.update_traces(textposition='top center')
fig.show()
No description has been provided for this image

What does this show? I asked Gemini to help me interpret the plot.

This chart is a Disease Classification Map. It tells you exactly which diseases are growing, which are fading, and which are acting crazy.

  1. The "Success Stories" (Bottom Left)
  • Where: Left side (Negative Trend), Bottom (Low Error).
  • What it means: These diseases are Falling steadily.
  • Story: Whatever the doctors and government are doing is working perfectly. The death count is dropping year after year with very little surprise.
  1. The "Steady Threats" (Bottom Right)
  • Where: Right side (Positive Trend), Bottom (Low Error).
  • What it means: These diseases are Rising steadily.
  • Story: This is a slow-moving crisis. It isn't an explosion, but a consistent "creep" upwards (like aging-related diseases or lifestyle diseases).
  1. The "Wild Cards" (Top Half)
  • Where: High up on the Y-axis (High Unpredictability).
  • What it means: These diseases are Chaotic.
  • Story: It doesn't matter if they are red or green; the high position means your straight line model is doing a bad job. These diseases likely have massive outbreaks one year and disappear the next.
  1. The "Big Killers" (Big Bubbles)
  • Where: Large circles anywhere.
  • What it means: These diseases kill the most people (Magnitude).
  • Story: Even if a disease is "Falling" (Green), if the bubble is huge, it is still a major priority. A 1% drop in a huge killer saves more lives than a 50% drop in a tiny killer.

Summary for your Project Best Scenario: A huge green bubble in the bottom left (A major killer is being defeated).

Worst Scenario: A huge red bubble in the bottom right (A major killer is getting worse, and we can't stop it).

Scariest Scenario: A huge bubble at the very top (A major killer that we can't predict).

In [12]:
# out of curiosity, I tried the same code again but with 3 components - ideally, this would create a 3D plot
from sklearn.decomposition import PCA
import pandas as pd
import plotly.express as px

# 1. RE-SHAPING THE DATA
# For PCA to work, data needs to be sorted that such that rows = diseases and columns = years
df_pivot = df_grouped.copy()

# 2. RUNNING PCA
pca = PCA(n_components=3)
components = pca.fit_transform(df_pivot)

# 3. CREATING A DATAFRAME FOR PLOTTING
pca_df = pd.DataFrame(data=components, columns=['PC1', 'PC2', 'PC3'])
pca_df['Disease Name'] = df_pivot.index

# 4. PLOT
# Realised a scatter plot wouldn't have worked for me -- had to look up what syntax (?) to use for a 3D plot
fig = px.scatter_3d(
    pca_df, x='PC1', y='PC2', z='PC3',
    text='Disease Name',
    title="PCA: Which diseases behave similarly?",
    template="plotly_white"
)
fig.update_traces(textposition='top center')

# Took this bottom part from previous code that needed similar formatting
fig.update_layout(
    height=800, 
    template="plotly_white",
    scene=dict(
        xaxis_title='PC1 (Main Trend)',
        yaxis_title='PC2 (Secondary Pattern)',
        zaxis_title='PC3 (Nuance)')
)
        
fig.show()
No description has been provided for this image

Extra: Non-PCA analysis¶

In response to what I wanted (the three components I had initially asked for), Gemini gave me the following code that I will keep here as reference.

In [6]:
import plotly.express as px
from scipy import stats
import pandas as pd
import numpy as np

# 1. CALCULATE YOUR CUSTOM COMPONENTS
custom_data = []

for disease in df_grouped.index:
    y = df_grouped.loc[disease].values
    x = np.arange(len(y))
    
    # Calculate Slope (Trend) and Standard Error (Unpredictability)
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
    
    # Classify the trend for color-coding
    if slope > 1:
        trend_label = "Rising"
    elif slope < -1:
        trend_label = "Falling"
    else:
        trend_label = "Flat/Stable"
        
    custom_data.append({
        'Disease Name': disease,
        'Trend Score (Slope)': slope,        # Component 1 & 2
        'Unpredictability (Error)': std_err, # Component 3
        'Category': trend_label,
        'Average Deaths': np.mean(y)         # Size of bubble
    })

df_custom = pd.DataFrame(custom_data)

# 2. PLOT THE RESULT
# X-Axis = Trend (Left is Falling, Right is Rising)
# Y-Axis = Unpredictability (Top is Chaos, Bottom is Stable)
fig = px.scatter(
    df_custom,
    x="Trend Score (Slope)",
    y="Unpredictability (Error)",
    color="Category",           # Color by Rising/Falling
    size="Average Deaths",      # Size by Magnitude
    text="Disease Name",
    title="Disease Classification: Rising, Falling, vs. Unpredictable",
    color_discrete_map={"Rising": "red", "Falling": "green", "Flat/Stable": "gray"}
)

# Add crosshairs to divide the quadrants
fig.add_vline(x=0, line_dash="dash", line_color="gray") # Zero Trend Line
fig.add_hline(y=df_custom['Unpredictability (Error)'].mean(), line_dash="dash", line_color="gray") # Average Noise Line

fig.update_traces(textposition='top center')
fig.update_layout(template="plotly_white", height=600)
fig.show()