< Home
Week 4: Transforms¶
December 9, 2025
Let's move on to the last assignment; I liked that Neil gave a quick overview of all the weeks we've covered in this Fab Futures. That said, this assignment might have been more useful at the beginning for better filtering our datasets. ;)
Principal Components Analysis (PCA)¶
I asked my friend ChatGPT to explain a little better what it is Principal Components Analysis (PCA)
Principal Components Analysis (PCA) is a mathematical technique that finds a linear transformation of the data that:
✅ 1. Minimizes correlation between the transformed variables
The principal components are orthogonal, meaning they are uncorrelated. This removes redundancy.
✅ 2. Maximizes explained variance
The first principal component captures the largest variance in the data.
The second captures the next largest amount, and so on.
The more variance a component explains, the more “information” it contains.
🎯 What is PCA used for?
Dimensionality reduction → reducing many variables into just a few.
Feature extraction → identifying meaningful directions in the data.
Low-dimensional visualization of high-dimensional datasets → projecting data into 2D or 3D for interpretation.
Preprocessing and denoising → removing noise and correlations, improving ML performance.
Complete example of PCA applied to my Barcelona data, filtering the time slots between 00:30 and 04:30 because there are no travelers.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# =========================================================
# 1. Load data and prepare MINUTES
# =========================================================
df = pd.read_csv(
"datasets/barcelona_viajeros_por_franja_csv.csv",
encoding="latin-1",
sep=";"
)
# Barcelona core only
df = df[df["NUCLEO_CERCANIAS"] == "BARCELONA"].copy()
# Convert TIME_SECTION to minutes from midnight (we use the start of the segment)
def time_to_min(tramo):
inicio = tramo.split("-")[0].strip()
h, m = map(int, inicio.split(":"))
return h * 60 + m
df["MINUTOS"] = df["TRAMO_HORARIO"].apply(time_to_min)
# =========================================================
# 2. Remove time slots between 00:30 and 04:30
# 00:30 -> 30 min, 04:30 -> 270 min
# =========================================================
df = df[(df["MINUTOS"] < 30) | (df["MINUTOS"] >= 270)].copy()
# =========================================================
# 3. Add by time slot (MINUTES) across the entire core
# =========================================================
df_agg = df.groupby("MINUTOS")[["VIAJEROS_SUBIDOS", "VIAJEROS_BAJADOS"]].sum()
df_agg = df_agg.sort_index().reset_index()
print("Primeras filas después del filtrado y la agregación:")
print(df_agg.head())
# =========================================================
# 4. Build a data matrix for PCA
# We use two variables: UPLOADED and DOWNLOADED
# =========================================================
X = df_agg[["VIAJEROS_SUBIDOS", "VIAJEROS_BAJADOS"]].values.astype(float)
# Standardization (very important for PCA)
X_mean = X.mean(axis=0)
X_std = X.std(axis=0) + 1e-8
X_stdzd = (X - X_mean) / X_std
# =========================================================
# 5. PCA "by hand": covariance + eigenvectors
# =========================================================
# Covariance matrix of the standardized data
cov_mat = np.cov(X_stdzd, rowvar=False)
# Eigenvalues and eigenvectors
eig_vals, eig_vecs = np.linalg.eigh(cov_mat) #eigh because it is symmetrical
# Sort from largest to smallest eigenvalue
idx = np.argsort(eig_vals)[::-1]
eig_vals = eig_vals[idx]
eig_vecs = eig_vecs[:, idx]
# Variance explained
explained_variance = eig_vals
explained_variance_ratio = eig_vals / eig_vals.sum()
print("\nAutovalores (varianza por componente):", explained_variance)
print("Varianza explicada (%):", explained_variance_ratio * 100)
# Projection of the data in the principal component space
X_pca = X_stdzd @ eig_vecs # each column = PC1, PC2, ...
# =========================================================
# 6. Explained variance graph
# =========================================================
plt.figure(figsize=(5,4))
components = np.arange(1, len(eig_vals)+1)
plt.bar(components, explained_variance_ratio * 100)
plt.xticks(components, [f"PC{i}" for i in components])
plt.ylabel("Variance explained (%)")
plt.title("PCA – Component-explained variance")
plt.tight_layout()
plt.show()
# =========================================================
# 7. Visualizing Data in PCA Space
# ==========================================================
# PC1 vs PC2, Colored by Time of Day
minutos = df_agg["MINUTOS"].values
plt.figure(figsize=(8,6))
scatter = plt.scatter(
X_pca[:, 0], # PC1
X_pca[:, 1], # PC2
c=minutos,
cmap="viridis"
)
plt.colorbar(scatter, label="Minutes since midnight")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA on passengers (boarding / alighting)\n(segments 00:30–04:30 removed)")
plt.tight_layout()
plt.show()
# =========================================================
# 8. Compare original space vs first component
# =========================================================
plt.figure(figsize=(8,6))
plt.scatter(
X_stdzd[:, 0],
X_stdzd[:, 1],
alpha=0.6,
label="Standardized data"
)
# We draw the direction of the first main component
origin = np.array([[0, 0]])
pc1_dir = eig_vecs[:, 0] * 2 # Scale for better viewing
plt.quiver(
origin[0,0], origin[0,1],
pc1_dir[0], pc1_dir[1],
angles='xy', scale_units='xy', scale=1,
color='r',
label="PC1"
)
plt.axhline(0, color="grey", linewidth=0.5)
plt.axvline(0, color="grey", linewidth=0.5)
plt.xlabel("PASSENGERS_ON (std)")
plt.ylabel("PASSENGERS_OFF (std)")
plt.title("Standardized original space + PC1 address")
plt.legend()
plt.tight_layout()
plt.show()
Primeras filas después del filtrado y la agregación: MINUTOS VIAJEROS_SUBIDOS VIAJEROS_BAJADOS 0 0 377 916 1 270 424 24 2 300 1120 584 3 330 2473 1049 4 360 6070 2499 Autovalores (varianza por componente): [1.94628479 0.10499726] Varianza explicada (%): [94.88138351 5.11861649]
PCA 3D → 2D Projection (Shape of the Day)¶
🧠 How to interpret this 3D → 2D PCA
✅ PC1
It will almost certainly be the combination that best describes:
The overall evolution of passenger volume throughout the day
That is: ✔ Low demand during the night ✔ Morning peak ✔ Midday valley ✔ Afternoon peak
👉 This is literally the “shape of the day.”
✅ PC2
It usually captures:
The difference between boarding and alighting patterns
For example:
“Origin” stations in the morning
“Destination” stations in the afternoon
✅ PC3
It usually represents:
Residual noise
Small local asymmetries
Variations that are not globally significant
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# =========================================================
# 1. Load data
# =========================================================
df = pd.read_csv(
"datasets/barcelona_viajeros_por_franja_csv.csv",
encoding="latin-1",
sep=";"
)
# Barcelona core only
df = df[df["NUCLEO_CERCANIAS"] == "BARCELONA"].copy()
# =========================================================
# 2. Convert TIME_SECTION to minutes from midnight
# =========================================================
def time_to_min(tramo):
inicio = tramo.split("-")[0].strip()
h, m = map(int, inicio.split(":"))
return h * 60 + m
df["MINUTOS"] = df["TRAMO_HORARIO"].apply(time_to_min)
# =========================================================
# 3. Eliminate non-service periods (00:30–04:30)
# 00:30 = 30 min | 04:30 = 270 min
# =========================================================
df = df[(df["MINUTOS"] < 30) | (df["MINUTOS"] >= 270)].copy()
# =========================================================
# 4. Add by time slot (entire core)
# =========================================================
df_agg = df.groupby("MINUTOS")[
["VIAJEROS_SUBIDOS", "VIAJEROS_BAJADOS"]
].sum().reset_index().sort_values("MINUTOS")
# =========================================================
# 5. Build a 3D matrix for PCA
# X = [MINUTES, UP, DOWN]
# =========================================================
X = df_agg[["MINUTOS", "VIAJEROS_SUBIDOS", "VIAJEROS_BAJADOS"]].values.astype(float)
# Standardization (key to PCA)
X_mean = X.mean(axis=0)
X_std = X.std(axis=0) + 1e-8
X_stdzd = (X - X_mean) / X_std
# =========================================================
# 6. 3D PCA "by hand" (covariance + eigenvectors)
# =========================================================
cov_mat = np.cov(X_stdzd, rowvar=False)
eig_vals, eig_vecs = np.linalg.eigh(cov_mat)
# Sort by variance in descending order
idx = np.argsort(eig_vals)[::-1]
eig_vals = eig_vals[idx]
eig_vecs = eig_vecs[:, idx]
explained_variance_ratio = eig_vals / eig_vals.sum()
print("Varianza explicada por componente:")
print("PC1:", explained_variance_ratio[0])
print("PC2:", explained_variance_ratio[1])
print("PC3:", explained_variance_ratio[2])
# =========================================================
# 7. 3D Projection → PCA Space
# =========================================================
X_pca = X_stdzd @ eig_vecs # (N, 3)
PC1 = X_pca[:, 0]
PC2 = X_pca[:, 1]
PC3 = X_pca[:, 2]
# =========================================================
# 8. Explained variance graph
# =========================================================
plt.figure(figsize=(5,4))
plt.bar(["PC1", "PC2", "PC3"], explained_variance_ratio * 100)
plt.ylabel("Variance explained (%)")
plt.title("3D PCA – Explained Variance")
plt.tight_layout()
plt.show()
# =========================================================
# 9. Final 2D Projection: PC1 vs PC2 (SHOW OF THE DAY)
# =========================================================
minutos = df_agg["MINUTOS"].values
plt.figure(figsize=(9,6))
scatter = plt.scatter(
PC1,
PC2,
c=minutos,
cmap="viridis"
)
plt.colorbar(scatter, label="Minutes since midnight")
plt.xlabel("PC1 (maximum variance)")
plt.ylabel("PC2 (second maximum variance)")
plt.title("PCA 3D → 2D Projection (PC1 and PC2)\nShape of the railway day in Barcelona")
plt.tight_layout()
plt.show()
# =========================================================
# 10. Geometric Interpretation in the Original Space
# Directions of PC1 and PC2 on the Original Variables
# =========================================================
plt.figure(figsize=(8,6))
plt.scatter(
X_stdzd[:, 0], # MINUTES
X_stdzd[:, 1], # UPLOADED
alpha=0.5,
label="Data (MINUTES vs UPLOADED)"
)
origin = np.array([[0, 0]])
pc1_dir = eig_vecs[[0,1], 0] * 3
pc2_dir = eig_vecs[[0,1], 1] * 3
plt.quiver(0, 0, pc1_dir[0], pc1_dir[1],
angles="xy", scale_units="xy", scale=1,
label="PC1")
plt.quiver(0, 0, pc2_dir[0], pc2_dir[1],
angles="xy", scale_units="xy", scale=1,
label="PC2")
plt.axhline(0)
plt.axvline(0)
plt.xlabel("MINUTES (std)")
plt.ylabel("PASSENGERS_ON (std)")
plt.title("Addresses of PC1 and PC2 in the original space")
plt.legend()
plt.tight_layout()
plt.show()
Varianza explicada por componente: PC1: 0.632673146927382 PC2: 0.33567944726256244 PC3: 0.031647405810055526