< Home
Week 3: Density Estimation¶
December 4, 2025
If in previous classes I was lost but managed to figure things out, in this one I feel completely lost. And Neil already warned us about that at the beginning of the class. So, with the few notes I made in class, I'll ask my friend ChatGPT if he can help me complete this assignment.
The first example is:
- Expectancy-Maximization (E-M) Apparently circular logic. Use the model to update latent (hidden) variables. Use latent variables to update the model. Iteration maximizes probability. You can use momentum for local minima.
Another example: Neil discussed k-means:
K-means: You can evaluate the number of clusters by plotting the total distance from the points to the anchor points as a function of their number.
Kernel Density Estimation
- Gaussian mixture models Also called kernel density estimation. Soft E-M clustering with Gaussian distributions. Can prune clusters with low probability
December 7, 2025
Expectancy-Maximization (E-M)¶
It is used when we have:
- A model with hidden (latent) variables
- Example: to which "peak hour group" each data point belongs
The algorithm iterates:
🔹E-step (Expectation) With the current model, we estimate:
- The probability that each data point belongs to each Gaussian distribution
These are the latent variables.
🔹 M-step (Maximization) With these probabilities:
- We update the model (means, variances, and weights)
Each iteration:
✅ Increases likelihood ❌ May fall into local minima ✅ Can be improved with boosts, restarts, etc.
This is the "seemingly circular logic":
- the model improves the hidden variables, and the hidden variables improve the model.
Prompt: Create a practical example of Density Estimation with Expectation–Maximization (EM) applied to your travelers by time slot in Barcelona.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# ============================================
# 1. Load and prepare data
# ============================================
df = pd.read_csv(
"datasets/barcelona_viajeros_por_franja_csv.csv",
encoding="latin-1",
sep=";"
)
# We're keeping only the core of Barcelona
df = df[df["NUCLEO_CERCANIAS"] == "BARCELONA"].copy()
def time_to_min(tramo):
inicio = tramo.split("-")[0].strip()
h, m = map(int, inicio.split(":"))
return h * 60 + m
df["MINUTOS"] = df["TRAMO_HORARIO"].apply(time_to_min)
# We're adding travelers by time slot
df_agg = df.groupby("MINUTOS")[["VIAJEROS_SUBIDOS", "VIAJEROS_BAJADOS"]].sum()
df_agg["VIAJEROS_TOTALES"] = df_agg["VIAJEROS_SUBIDOS"] + df_agg["VIAJEROS_BAJADOS"]
# Variable we are going to model (one dimension): total travelers per slot
X = df_agg["VIAJEROS_TOTALES"].values.astype(float)
# We normalize (very important for numerical stability)
X = (X - X.mean()) / X.std()
N = X.shape[0]
# ============================================
#2. Initialize GMM (2 Gaussians)
# ============================================
K = 2
np.random.seed(0)
mu = np.random.randn(K) # medias iniciales
sigma = np.ones(K) # desviaciones típicas iniciales
pi = np.ones(K) / K # pesos iniciales
# ============================================
# 3. Auxiliary functions
# ============================================
def gaussian(x, mu, sigma):
"""Densidad de una normal N(mu, sigma^2) evaluada en x."""
return (1.0 / (np.sqrt(2*np.pi) * sigma)) * np.exp(-0.5 * ((x - mu) / sigma)**2)
def log_likelihood(X, mu, sigma, pi):
ll = 0.0
for x in X:
comp = 0.0
for k in range(K):
comp += pi[k] * gaussian(x, mu[k], sigma[k])
ll += np.log(comp + 1e-12)
return ll
# ============================================
# 4. EM Algorithm
# ============================================
num_iters = 100
loglik = []
for it in range(num_iters):
# ---------- E-step ----------
gamma = np.zeros((N, K)) # responsabilidades
for i, x in enumerate(X):
for k in range(K):
gamma[i, k] = pi[k] * gaussian(x, mu[k], sigma[k])
gamma[i, :] /= gamma[i, :].sum()
Nk = gamma.sum(axis=0)
# ---------- M-step ----------
for k in range(K):
mu[k] = np.sum(gamma[:, k] * X) / Nk[k]
sigma[k] = np.sqrt(
np.sum(gamma[:, k] * (X - mu[k])**2) / Nk[k] + 1e-12
)
pi[k] = Nk[k] / N
ll = log_likelihood(X, mu, sigma, pi)
loglik.append(ll)
# ============================================
# 5. Density Estimation: histogram + estimated density
# ============================================
x_plot = np.linspace(X.min(), X.max(), 300)
pdf = np.zeros_like(x_plot)
for k in range(K):
pdf += pi[k] * gaussian(x_plot, mu[k], sigma[k])
plt.figure(figsize=(10, 5))
plt.hist(X, bins=15, density=True, alpha=0.6, label="Actual (normalized) data")
plt.plot(x_plot, pdf, label="Estimated density (GMM + EM)")
plt.xlabel("Total travelers (normalized)")
plt.ylabel("Density")
plt.title("Density Estimation con EM (Gaussian Mixture, K=2)")
plt.legend()
plt.tight_layout()
plt.show()
# ============================================
# 6. Convergence: log-likelihood
# ============================================
plt.figure(figsize=(6, 4))
plt.plot(loglik)
plt.xlabel("Iteration")
plt.ylabel("Log-likelihood")
plt.title("EM Convergence")
plt.tight_layout()
plt.show()
Kernel Density Estimation¶
Gaussian mixture models¶
Prompt: Can you describe a little bit how the Gaussian Mixture Model works?
A Gaussian Mixture Model (GMM) is a probabilistic model that assumes that the data is generated from a mixture of several Gaussian (normal) distributions. Each Gaussian represents a hidden subgroup (cluster) inside the data.
Instead of assigning each data point to exactly one cluster (as in k-means), a GMM performs soft clustering: each point has a probability of belonging to each Gaussian component.
# =========================================================
# 0. IMPORTS
# =========================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D # for 3D histograms
# If you don't have scikit-learn, run the following first:
# !pip install scikit-learn
from sklearn.mixture import GaussianMixture
# =========================================================
# 1. LOAD AND PREPARE DATA
# =========================================================
df = pd.read_csv(
"datasets/barcelona_viajeros_por_franja_csv.csv",
encoding="latin-1",
sep=";"
)
# We're keeping only the core of Barcelona
df = df[df["NUCLEO_CERCANIAS"] == "BARCELONA"].copy()
# We converted TIME_SECTION to minutes starting at midnight
def time_to_min(tramo):
inicio = tramo.split("-")[0].strip()
h, m = map(int, inicio.split(":"))
return h * 60 + m
df["MINUTOS"] = df["TRAMO_HORARIO"].apply(time_to_min)
# We add by time slot (the entire core)
df_agg = df.groupby("MINUTOS")[["VIAJEROS_SUBIDOS", "VIAJEROS_BAJADOS"]].sum()
df_agg["VIAJEROS_TOTALES"] = df_agg["VIAJEROS_SUBIDOS"] + df_agg["VIAJEROS_BAJADOS"]
df_agg = df_agg.reset_index().sort_values("MINUTOS")
# Variables for the model
# We use 2D: [MINUTES, TOTAL_TRAVELERS]
X = df_agg[["MINUTOS", "VIAJEROS_TOTALES"]].values.astype(float)
# We normalized (better for GMM)
X_mean = X.mean(axis=0)
X_std = X.std(axis=0) + 1e-8
X_norm = (X - X_mean) / X_std
# =========================================================
#2. FITTING GAUSSIAN MIXTURE MODEL (GMM) WITH EM
# =========================================================
n_components = 3 # For example: morning / noon / afternoon
gmm = GaussianMixture(
n_components=n_components,
covariance_type="full",
random_state=0
)
gmm.fit(X_norm) # EM soft clustering
# Weights of each component (to prune unlikely clusters)
weights = gmm.weights_ # size K
print("Pesos de mezcla de los clusters:", weights)
# Pruning: we keep clusters whose weight > threshold
weight_threshold = 0.05
mask_clusters = weights > weight_threshold
idx_clusters_validos = np.where(mask_clusters)[0]
print("Clusters válidos (no podados):", idx_clusters_validos)
# Membership probabilities (soft EM clustering)
responsabilidades = gmm.predict_proba(X_norm) # shape (N, K)
cluster_suave = responsabilidades[:, idx_clusters_validos].argmax(axis=1)
# =========================================================
# 3. 2D HISTOGRAM (MINUTES vs TOTAL TRAVELERS)
# =========================================================
plt.figure(figsize=(8, 6))
plt.hist2d(
df_agg["MINUTOS"],
df_agg["VIAJEROS_TOTALES"],
bins=15
)
plt.colorbar(label="Frequency")
plt.xlabel("Minutes since midnight")
plt.ylabel("Total travelers")
plt.title("2D Histogram – MINUTES vs TOTAL TRAVELERS")
plt.tight_layout()
plt.show()
# =========================================================
# 4. 3D HISTOGRAM (MINUTES vs TOTAL TRAVELERS)
# =========================================================
# We used a 2D histogram and extruded it as 3D bars
x = df_agg["MINUTOS"].values
y = df_agg["VIAJEROS_TOTALES"].values
# We define bin grid
xbins = 10
ybins = 10
H, xedges, yedges = np.histogram2d(x, y, bins=[xbins, ybins])
# We construct the coordinates of the bars
xpos, ypos = np.meshgrid(
(xedges[:-1] + xedges[1:]) / 2,
(yedges[:-1] + yedges[1:]) / 2,
indexing="ij"
)
xpos = xpos.ravel()
ypos = ypos.ravel()
zpos = np.zeros_like(xpos)
dx = (xedges[1] - xedges[0]) * np.ones_like(zpos)
dy = (yedges[1] - yedges[0]) * np.ones_like(zpos)
dz = H.ravel()
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection="3d")
ax.bar3d(xpos, ypos, zpos, dx, dy, dz)
ax.set_xlabel("Minutes since midnight")
ax.set_ylabel("Total travelers")
ax.set_zlabel("Frequency")
ax.set_title("3D Histogram – MINUTES vs TOTAL TRAVELERS")
plt.tight_layout()
plt.show()
# =========================================================
# 5. VISUALIZE GMM CLUSTERS IN 2D
# =========================================================
plt.figure(figsize=(8, 6))
scatter = plt.scatter(
X[:, 0], # MINUTES
X[:, 1], # TOTAL_TRAVELERS
c=cluster_suave,
cmap="viridis"
)
plt.xlabel("Minutes since midnight")
plt.ylabel("Total travelers")
plt.title("Gaussian Mixture Model – soft clustering (valid clusters)")
plt.colorbar(scatter, label="Cluster")
plt.tight_layout()
plt.show()
Pesos de mezcla de los clusters: [0.23206772 0.4925547 0.27537758] Clusters válidos (no podados): [0 1 2]