Adrian Torres - Fab Lab León - Fab Futures - Data Science
Home About

< Home

Week 4: Transforms¶

December 9, 2025

Let's move on to the last assignment; I liked that Neil gave a quick overview of all the weeks we've covered in this Fab Futures. That said, this assignment might have been more useful at the beginning for better filtering our datasets. ;)

Principal Components Analysis (PCA)¶

I asked my friend ChatGPT to explain a little better what it is Principal Components Analysis (PCA)

Principal Components Analysis (PCA) is a mathematical technique that finds a linear transformation of the data that:

✅ 1. Minimizes correlation between the transformed variables

The principal components are orthogonal, meaning they are uncorrelated. This removes redundancy.

✅ 2. Maximizes explained variance

The first principal component captures the largest variance in the data.

The second captures the next largest amount, and so on.

The more variance a component explains, the more “information” it contains.

🎯 What is PCA used for?

Dimensionality reduction → reducing many variables into just a few.

Feature extraction → identifying meaningful directions in the data.

Low-dimensional visualization of high-dimensional datasets → projecting data into 2D or 3D for interpretation.

Preprocessing and denoising → removing noise and correlations, improving ML performance.

Complete example of PCA applied to my Barcelona data, filtering the time slots between 00:30 and 04:30 because there are no travelers.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# =========================================================
# 1. Load data and prepare MINUTES
# =========================================================
df = pd.read_csv(
    "datasets/barcelona_viajeros_por_franja_csv.csv",
    encoding="latin-1",
    sep=";"
)

# Barcelona core only
df = df[df["NUCLEO_CERCANIAS"] == "BARCELONA"].copy()

# Convert TIME_SECTION to minutes from midnight (we use the start of the segment)
def time_to_min(tramo):
    inicio = tramo.split("-")[0].strip()
    h, m = map(int, inicio.split(":"))
    return h * 60 + m

df["MINUTOS"] = df["TRAMO_HORARIO"].apply(time_to_min)

# =========================================================
# 2. Remove time slots between 00:30 and 04:30
# 00:30 -> 30 min, 04:30 -> 270 min
# =========================================================
df = df[(df["MINUTOS"] < 30) | (df["MINUTOS"] >= 270)].copy()

# =========================================================
# 3. Add by time slot (MINUTES) across the entire core
# =========================================================
df_agg = df.groupby("MINUTOS")[["VIAJEROS_SUBIDOS", "VIAJEROS_BAJADOS"]].sum()
df_agg = df_agg.sort_index().reset_index()

print("Primeras filas después del filtrado y la agregación:")
print(df_agg.head())

# =========================================================
# 4. Build a data matrix for PCA
# We use two variables: UPLOADED and DOWNLOADED
# =========================================================
X = df_agg[["VIAJEROS_SUBIDOS", "VIAJEROS_BAJADOS"]].values.astype(float)

# Standardization (very important for PCA)
X_mean = X.mean(axis=0)
X_std  = X.std(axis=0) + 1e-8
X_stdzd = (X - X_mean) / X_std

# =========================================================
# 5. PCA "by hand": covariance + eigenvectors
# =========================================================
# Covariance matrix of the standardized data
cov_mat = np.cov(X_stdzd, rowvar=False)

# Eigenvalues and eigenvectors
eig_vals, eig_vecs = np.linalg.eigh(cov_mat)  #eigh because it is symmetrical

# Sort from largest to smallest eigenvalue
idx = np.argsort(eig_vals)[::-1]
eig_vals = eig_vals[idx]
eig_vecs = eig_vecs[:, idx]

# Variance explained
explained_variance = eig_vals
explained_variance_ratio = eig_vals / eig_vals.sum()

print("\nAutovalores (varianza por componente):", explained_variance)
print("Varianza explicada (%):", explained_variance_ratio * 100)

# Projection of the data in the principal component space
X_pca = X_stdzd @ eig_vecs  # each column = PC1, PC2, ...

# =========================================================
# 6. Explained variance graph
# =========================================================
plt.figure(figsize=(5,4))
components = np.arange(1, len(eig_vals)+1)
plt.bar(components, explained_variance_ratio * 100)
plt.xticks(components, [f"PC{i}" for i in components])
plt.ylabel("Variance explained (%)")
plt.title("PCA – Component-explained variance")
plt.tight_layout()
plt.show()

# =========================================================
# 7. Visualizing Data in PCA Space
# ==========================================================
# PC1 vs PC2, Colored by Time of Day
minutos = df_agg["MINUTOS"].values

plt.figure(figsize=(8,6))
scatter = plt.scatter(
    X_pca[:, 0],  # PC1
    X_pca[:, 1],  # PC2
    c=minutos,
    cmap="viridis"
)
plt.colorbar(scatter, label="Minutes since midnight")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA on passengers (boarding / alighting)\n(segments 00:30–04:30 removed)")
plt.tight_layout()
plt.show()

# =========================================================
# 8. Compare original space vs first component
# =========================================================
plt.figure(figsize=(8,6))
plt.scatter(
    X_stdzd[:, 0],
    X_stdzd[:, 1],
    alpha=0.6,
    label="Standardized data"
)

# We draw the direction of the first main component
origin = np.array([[0, 0]])
pc1_dir = eig_vecs[:, 0] * 2  # Scale for better viewing
plt.quiver(
    origin[0,0], origin[0,1],
    pc1_dir[0], pc1_dir[1],
    angles='xy', scale_units='xy', scale=1,
    color='r',
    label="PC1"
)

plt.axhline(0, color="grey", linewidth=0.5)
plt.axvline(0, color="grey", linewidth=0.5)
plt.xlabel("PASSENGERS_ON (std)")
plt.ylabel("PASSENGERS_OFF (std)")
plt.title("Standardized original space + PC1 address")
plt.legend()
plt.tight_layout()
plt.show()
Primeras filas después del filtrado y la agregación:
   MINUTOS  VIAJEROS_SUBIDOS  VIAJEROS_BAJADOS
0        0               377               916
1      270               424                24
2      300              1120               584
3      330              2473              1049
4      360              6070              2499

Autovalores (varianza por componente): [1.94628479 0.10499726]
Varianza explicada (%): [94.88138351  5.11861649]
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

PCA 3D → 2D Projection (Shape of the Day)¶

🧠 How to interpret this 3D → 2D PCA

✅ PC1

It will almost certainly be the combination that best describes:

The overall evolution of passenger volume throughout the day

That is: ✔ Low demand during the night ✔ Morning peak ✔ Midday valley ✔ Afternoon peak

👉 This is literally the “shape of the day.”

✅ PC2

It usually captures:

The difference between boarding and alighting patterns

For example:

“Origin” stations in the morning

“Destination” stations in the afternoon

✅ PC3

It usually represents:

Residual noise

Small local asymmetries

Variations that are not globally significant

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# =========================================================
# 1. Load data
# =========================================================
df = pd.read_csv(
    "datasets/barcelona_viajeros_por_franja_csv.csv",
    encoding="latin-1",
    sep=";"
)

# Barcelona core only
df = df[df["NUCLEO_CERCANIAS"] == "BARCELONA"].copy()

# =========================================================
# 2. Convert TIME_SECTION to minutes from midnight
# =========================================================
def time_to_min(tramo):
    inicio = tramo.split("-")[0].strip()
    h, m = map(int, inicio.split(":"))
    return h * 60 + m

df["MINUTOS"] = df["TRAMO_HORARIO"].apply(time_to_min)

# =========================================================
# 3. Eliminate non-service periods (00:30–04:30)
# 00:30 = 30 min | 04:30 = 270 min
# =========================================================
df = df[(df["MINUTOS"] < 30) | (df["MINUTOS"] >= 270)].copy()

# =========================================================
# 4. Add by time slot (entire core)
# =========================================================
df_agg = df.groupby("MINUTOS")[
    ["VIAJEROS_SUBIDOS", "VIAJEROS_BAJADOS"]
].sum().reset_index().sort_values("MINUTOS")

# =========================================================
# 5. Build a 3D matrix for PCA
# X = [MINUTES, UP, DOWN]
# =========================================================
X = df_agg[["MINUTOS", "VIAJEROS_SUBIDOS", "VIAJEROS_BAJADOS"]].values.astype(float)

# Standardization (key to PCA)
X_mean = X.mean(axis=0)
X_std  = X.std(axis=0) + 1e-8
X_stdzd = (X - X_mean) / X_std

# =========================================================
# 6. 3D PCA "by hand" (covariance + eigenvectors)
# =========================================================
cov_mat = np.cov(X_stdzd, rowvar=False)

eig_vals, eig_vecs = np.linalg.eigh(cov_mat)

# Sort by variance in descending order
idx = np.argsort(eig_vals)[::-1]
eig_vals = eig_vals[idx]
eig_vecs = eig_vecs[:, idx]

explained_variance_ratio = eig_vals / eig_vals.sum()

print("Varianza explicada por componente:")
print("PC1:", explained_variance_ratio[0])
print("PC2:", explained_variance_ratio[1])
print("PC3:", explained_variance_ratio[2])

# =========================================================
# 7. 3D Projection → PCA Space
# =========================================================
X_pca = X_stdzd @ eig_vecs   # (N, 3)

PC1 = X_pca[:, 0]
PC2 = X_pca[:, 1]
PC3 = X_pca[:, 2]

# =========================================================
# 8. Explained variance graph
# =========================================================
plt.figure(figsize=(5,4))
plt.bar(["PC1", "PC2", "PC3"], explained_variance_ratio * 100)
plt.ylabel("Variance explained (%)")
plt.title("3D PCA – Explained Variance")
plt.tight_layout()
plt.show()

# =========================================================
# 9. Final 2D Projection: PC1 vs PC2 (SHOW OF THE DAY)
# =========================================================
minutos = df_agg["MINUTOS"].values

plt.figure(figsize=(9,6))
scatter = plt.scatter(
    PC1,
    PC2,
    c=minutos,
    cmap="viridis"
)

plt.colorbar(scatter, label="Minutes since midnight")
plt.xlabel("PC1 (maximum variance)")
plt.ylabel("PC2 (second maximum variance)")
plt.title("PCA 3D → 2D Projection (PC1 and PC2)\nShape of the railway day in Barcelona")
plt.tight_layout()
plt.show()

# =========================================================
# 10. Geometric Interpretation in the Original Space
# Directions of PC1 and PC2 on the Original Variables
# =========================================================
plt.figure(figsize=(8,6))

plt.scatter(
    X_stdzd[:, 0],  # MINUTES
    X_stdzd[:, 1],  # UPLOADED
    alpha=0.5,
    label="Data (MINUTES vs UPLOADED)"
)

origin = np.array([[0, 0]])

pc1_dir = eig_vecs[[0,1], 0] * 3
pc2_dir = eig_vecs[[0,1], 1] * 3

plt.quiver(0, 0, pc1_dir[0], pc1_dir[1],
           angles="xy", scale_units="xy", scale=1,
           label="PC1")

plt.quiver(0, 0, pc2_dir[0], pc2_dir[1],
           angles="xy", scale_units="xy", scale=1,
           label="PC2")

plt.axhline(0)
plt.axvline(0)

plt.xlabel("MINUTES (std)")
plt.ylabel("PASSENGERS_ON (std)")
plt.title("Addresses of PC1 and PC2 in the original space")
plt.legend()
plt.tight_layout()
plt.show()
Varianza explicada por componente:
PC1: 0.632673146927382
PC2: 0.33567944726256244
PC3: 0.031647405810055526
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image