Class 6: Density estimation¶
The EM algorithm is the best way to find the optimal parameters in a model when there are hiden variables (we can not see it) or we dont know at which group each data belongs.
In example, in the dataset of temperatures in my home town León, in the last week documentation, we can see that there are two groups of data: those that belongs to Winter season and those that belongs to summer season.
And there are some in between without great up and downs extreme data on it, that means the rest of the year. So im expecting two extreme groups of data, winter and summer.
A Gaussian Mixture Model (GMM) assume that the data comes from:
P(x)=k=1∑KπkN(x∣μk,σk)
K = number of components πk = weight of each distribution μk,σk = average and deviation of each component.
The EM works in two phases:
- Expectation step (E) Here check the probability that each data belongs to some gaussian group. in example, a temperature of 5º has more probability to belong to the winter season that to the summer. Then calculate this probabilityç
- Maximization step (M) Use this probabilities to recalculate new average data, new weights for the ML and new deviations in the data.
This process is repeacted until all the information converge in the right place.
Working with temperature data¶
Firs we are going to connect to the API of spanis meteo agency and store the data of temperatures (6months only)
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# (Opcional) intento importar scipy para ajustar una normal
try:
from scipy.stats import norm
HAS_SCIPY = True
except ImportError:
HAS_SCIPY = False
print("⚠️ SciPy no está instalado. No se hará el ajuste a distribución normal.")
# ==============================
# 1. CONFIGURACIÓN
# ==============================
API_KEY = "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJwZWlib2xAbWFpbC5jb20iLCJqdGkiOiI2NmZlMmExOS0wYjkzLTQwNmMtOTljMC1mNjQ0MDNlMDlhN2UiLCJpc3MiOiJBRU1FVCIsImlhdCI6MTc2NTEzODgwNywidXNlcklkIjoiNjZmZTJhMTktMGI5My00MDZjLTk5YzAtZjY0NDAzZTA5YTdlIiwicm9sZSI6IiJ9.LIkIHu9HvDY7dKw9nncPjXYOo8xyDWi1taIcrhr6m2U" # 🔴 CAMBIA ESTO por tu API key de AEMET
# Estación de León - Virgen del Camino (muy usada como referencia de León)
station_id = "2661"
# Rango de fechas que quieres analizar
start_date = "2024-08-01T00:00:00UTC"
end_date = "2024-12-31T23:59:59UTC"
# ==============================
# 2. DESCARGAR DATOS DE AEMET
# ==============================
base_url = (
"https://opendata.aemet.es/opendata/api/"
"valores/climatologicos/diarios/datos/"
f"fechaini/{start_date}/fechafin/{end_date}/estacion/{station_id}"
)
headers = {"api_key": API_KEY}
print("Llamando a la API de AEMET...")
resp_meta = requests.get(base_url, headers=headers)
if resp_meta.status_code != 200:
raise RuntimeError(f"Error en la petición AEMET (meta): {resp_meta.status_code} - {resp_meta.text}")
meta_json = resp_meta.json()
print("Respuesta meta:", meta_json.get("descripcion", "OK"))
data_url = meta_json.get("datos")
if not data_url:
raise RuntimeError("No se encontró la URL de datos en la respuesta de AEMET.")
print("Descargando datos reales desde:", data_url)
resp_data = requests.get(data_url)
if resp_data.status_code != 200:
raise RuntimeError(f"Error al descargar los datos: {resp_data.status_code} - {resp_data.text}")
# AEMET devuelve un JSON (lista de dicts)
data_json = resp_data.json()
# ==============================
# 3. PASAR A DATAFRAME Y LIMPIAR
# ==============================
df = pd.DataFrame(data_json)
# Convertir columnas numéricas a float (tmed, tmax, tmin suelen venir como strings)
for col in ["tmed", "tmax", "tmin"]:
if col in df.columns:
df[col] = pd.to_numeric(df[col].str.replace(",", "."), errors="coerce")
# Convertir fecha a datetime
df["fecha"] = pd.to_datetime(df["fecha"], errors="coerce")
print("\nPrimeras filas de los datos:")
print(df.head())
print("\nInformación del DataFrame:")
print(df.info())
Llamando a la API de AEMET...
Respuesta meta: exito
Descargando datos reales desde: https://opendata.aemet.es/opendata/sh/a944e043
Primeras filas de los datos:
fecha indicativo nombre provincia altitud tmed prec \
0 2024-08-01 2661 LEÓN, VIRGEN DEL CAMINO LEON 916 24.2 0,0
1 2024-08-02 2661 LEÓN, VIRGEN DEL CAMINO LEON 916 22.2 0,0
2 2024-08-03 2661 LEÓN, VIRGEN DEL CAMINO LEON 916 21.5 0,0
3 2024-08-04 2661 LEÓN, VIRGEN DEL CAMINO LEON 916 22.7 0,0
4 2024-08-05 2661 LEÓN, VIRGEN DEL CAMINO LEON 916 23.7 0,0
tmin horatmin tmax ... sol presMax horaPresMax presMin horaPresMin \
0 16.5 05:30 31.8 ... 12,4 914,0 00 911,0 16
1 14.9 03:33 29.6 ... 13,2 915,6 Varias 911,6 04
2 11.9 03:55 31.1 ... 13,9 915,6 00 912,3 18
3 13.4 05:42 32.0 ... 13,7 915,1 Varias 911,6 18
4 14.9 04:42 32.5 ... 13,7 913,0 00 909,3 16
hrMedia hrMax horaHrMax hrMin horaHrMin
0 44 78 05:54 22 15:44
1 52 85 Varias 28 16:51
2 38 93 Varias 20 16:32
3 31 74 Varias 19 17:30
4 34 76 Varias 16 12:15
[5 rows x 25 columns]
Información del DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fecha 153 non-null datetime64[ns]
1 indicativo 153 non-null object
2 nombre 153 non-null object
3 provincia 153 non-null object
4 altitud 153 non-null object
5 tmed 153 non-null float64
6 prec 153 non-null object
7 tmin 153 non-null float64
8 horatmin 152 non-null object
9 tmax 153 non-null float64
10 horatmax 152 non-null object
11 dir 151 non-null object
12 velmedia 152 non-null object
13 racha 151 non-null object
14 horaracha 151 non-null object
15 sol 152 non-null object
16 presMax 152 non-null object
17 horaPresMax 152 non-null object
18 presMin 152 non-null object
19 horaPresMin 152 non-null object
20 hrMedia 152 non-null object
21 hrMax 152 non-null object
22 horaHrMax 152 non-null object
23 hrMin 152 non-null object
24 horaHrMin 152 non-null object
dtypes: datetime64[ns](1), float64(3), object(21)
memory usage: 30.0+ KB
None
And now that the data are in the df , lets scikit-learn example to use EM to this data:
from sklearn.mixture import GaussianMixture
import numpy as np
t = df["tmed"].dropna().values.reshape(-1, 1)
# Ajustar dos gaussianas (invierno y verano)
gmm = GaussianMixture(n_components=2, random_state=42)
gmm.fit(t)
print("means:", gmm.means_.flatten())
print("Covariances:", np.sqrt(gmm.covariances_).flatten())
print("Weights:", gmm.weights_)
means: [ 8.00986681 16.45978074] Covariances: [4.61955284 4.90995054] Weights: [0.42483819 0.57516181]
other try: weight of newborns¶
Lets do another try for density estimation with an example dataset that i can find in the kaggle website: https://www.kaggle.com/competitions/prediction-interval-competition-i-birth-weight/data
At this competition there is a file called test that you can download, and inside this file there is a column named "DBWT" that has all the weights of newborns.
The database i download is called birth_data.csv and its uploaded un my datasets folder.
Now lets make some desity estimation on it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.mixture import GaussianMixture
# ================================
# 1. CARGAR TU DATASET
# ================================
df = pd.read_csv("datasets/birth_data.csv")
# Asegurar que DBWT es numérica (por si hay valores raros)
df["DBWT"] = pd.to_numeric(df["DBWT"], errors="coerce")
# Eliminar NaN
bw = df["DBWT"].dropna().values.reshape(-1, 1)
print(f"Número de pesos válidos: {len(bw)}")
print("Primeros valores:", bw[:10].flatten())
# ================================
# 2. KDE - estimación no paramétrica
# ================================
plt.figure(figsize=(10,5))
sns.kdeplot(bw.flatten(), fill=True, label="KDE (Estimación no paramétrica)", linewidth=2)
plt.title("Densidad del peso al nacer (DBWT)")
plt.xlabel("Peso (g)")
plt.grid(True)
plt.legend()
plt.show()
# ================================
# 3. AJUSTE NORMAL
# ================================
mu, sigma = norm.fit(bw.flatten())
x = np.linspace(bw.min(), bw.max(), 300)
pdf_normal = norm.pdf(x, mu, sigma)
plt.figure(figsize=(10,5))
plt.hist(bw, bins=30, density=True, alpha=0.5, label="Datos (histograma)")
plt.plot(x, pdf_normal, "r-", lw=2, label=f"Normal ajustada μ={mu:.1f}, σ={sigma:.1f}")
plt.title("Ajuste de distribución normal al peso DBWT")
plt.xlabel("Peso (g)")
plt.grid(True)
plt.legend()
plt.show()
print(f"Normal ajustada → mu={mu:.1f}, sigma={sigma:.1f}")
# ================================
# 4. EM - Mezcla de Gaussianas (GMM)
# ================================
gmm = GaussianMixture(n_components=2, random_state=42)
gmm.fit(bw)
means = gmm.means_.flatten()
stds = np.sqrt(gmm.covariances_.flatten())
weights = gmm.weights_.flatten()
print("\n=== GMM (EM) ===")
for i in range(2):
print(f"Componente {i+1}: μ={means[i]:.1f} g, σ={stds[i]:.1f} g, peso={weights[i]:.2f}")
# densidad total estimada por el GMM
logprob = gmm.score_samples(x.reshape(-1,1))
pdf_gmm = np.exp(logprob)
plt.figure(figsize=(10,5))
plt.hist(bw, bins=30, density=True, alpha=0.5, label="Datos (histograma)")
plt.plot(x, pdf_gmm, "k-", lw=2, label="GMM (mezcla total)")
# Componentes individuales
def normal_pdf(x, mean, std):
return 1/(std*np.sqrt(2*np.pi)) * np.exp(-0.5 * ((x-mean)/std)**2)
for i in range(2):
plt.plot(
x,
weights[i] * normal_pdf(x, means[i], stds[i]),
"--",
lw=2,
label=f"Componente {i+1}"
)
plt.title("Ajuste con EM (GMM) al peso DBWT")
plt.xlabel("Peso (g)")
plt.grid(True)
plt.legend()
plt.show()
Número de pesos válidos: 108082 Primeros valores: [2800 1900 2960 3657 3742 3317 3550 515 2438 4015]
Normal ajustada → mu=3260.1, sigma=589.5 === GMM (EM) === Componente 1: μ=3369.9 g, σ=400.3 g, peso=0.69 Componente 2: μ=3018.8 g, σ=821.4 g, peso=0.31
Wow!! beautiful gaussian thing hehe. As we can see the model identify two set of components:
- Component 1: those with weight around 3,3kg with a deviation of 400g. This are the normal weight of the newborn, and are the 70% of the data
- Component 2: those which dont fit in the first component, but we can see that the average weight (3kg) is very close to the component 1 . I think thats because there are a lot of data that are close to the normal weight, but not so common than those which are in the normal weight. Also we can see the green line in the graphic, with a little displacement on the left, saying that there are more babys with a weight below average
Now i want to try this again, but adding another variable: the sex of the baby. In the dataset there is a column called SEX that has the sex of the baby. Lets check what estimation i have for each type of sex:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.mixture import GaussianMixture
# ============================
# 1. CARGAR DATOS
# ============================
df = pd.read_csv("datasets/birth_data.csv")
# Aseguramos tipos numéricos
df["DBWT"] = pd.to_numeric(df["DBWT"], errors="coerce")
# No tocamos SEX todavía, puede ser numérico o texto
df = df.dropna(subset=["DBWT", "SEX"])
print("Valores únicos de SEX en el fichero:", df["SEX"].unique())
# ============================
# 2. FUNCIÓN PARA ANALIZAR UN GRUPO
# ============================
def analyze_group(data, label):
data = data.dropna().values.reshape(-1, 1)
if len(data) < 10:
print(f"\n[AVISO] El grupo '{label}' tiene muy pocos datos ({len(data)}). Lo salto.")
return
print(f"\n===== Análisis para grupo: {label} (n={len(data)}) =====")
# KDE
plt.figure(figsize=(10,5))
sns.kdeplot(data.flatten(), fill=True, linewidth=2)
plt.title(f"KDE – Peso al nacer ({label})")
plt.xlabel("Peso (g)")
plt.grid(True)
plt.show()
# Normal
mu, sigma = norm.fit(data.flatten())
x = np.linspace(data.min(), data.max(), 300)
pdf_normal = norm.pdf(x, mu, sigma)
plt.figure(figsize=(10,5))
plt.hist(data, bins=30, density=True, alpha=0.4, label="Datos")
plt.plot(x, pdf_normal, "r-", lw=2, label=f"Normal μ={mu:.1f}, σ={sigma:.1f}")
plt.title(f"Ajuste Normal – Peso ({label})")
plt.xlabel("Peso (g)")
plt.grid(True)
plt.legend()
plt.show()
print(f"Normal ({label}): μ={mu:.1f}, σ={sigma:.1f}")
# GMM con 2 componentes
gmm = GaussianMixture(n_components=2, random_state=42)
gmm.fit(data)
means = gmm.means_.flatten()
stds = np.sqrt(gmm.covariances_.flatten())
weights = gmm.weights_.flatten()
print(f"\n=== GMM (EM) – {label} ===")
for i in range(2):
print(f"Componente {i+1}: μ={means[i]:.1f} g, σ={stds[i]:.1f} g, peso={weights[i]:.2f}")
# Densidad total GMM
logprob = gmm.score_samples(x.reshape(-1,1))
pdf_gmm = np.exp(logprob)
plt.figure(figsize=(10,5))
plt.hist(data, bins=30, density=True, alpha=0.4, label="Datos")
plt.plot(x, pdf_gmm, "k-", lw=2, label="GMM – mezcla total")
# Componentes individuales
def normal_pdf(x, mean, std):
return 1/(std*np.sqrt(2*np.pi)) * np.exp(-0.5 * ((x-mean)/std)**2)
for i in range(2):
plt.plot(
x,
weights[i] * normal_pdf(x, means[i], stds[i]),
"--",
lw=2,
label=f"Componente {i+1}"
)
plt.title(f"GMM (EM) – Peso al nacer ({label})")
plt.xlabel("Peso (g)")
plt.grid(True)
plt.legend()
plt.show()
# ============================
# 3. AGRUPAR AUTOMÁTICAMENTE POR SEX
# ============================
for sex_value, group in df.groupby("SEX"):
label = f"SEX = {sex_value}"
analyze_group(group["DBWT"], label)
Valores únicos de SEX en el fichero: ['F' 'M'] ===== Análisis para grupo: SEX = F (n=52756) =====
Normal (SEX = F): μ=3201.8, σ=574.9 === GMM (EM) – SEX = F === Componente 1: μ=3305.3 g, σ=391.0 g, peso=0.70 Componente 2: μ=2962.9 g, σ=811.7 g, peso=0.30
===== Análisis para grupo: SEX = M (n=55326) =====
Normal (SEX = M): μ=3315.7, σ=597.9 === GMM (EM) – SEX = M === Componente 1: μ=3052.9 g, σ=838.7 g, peso=0.30 Componente 2: μ=3430.0 g, σ=404.9 g, peso=0.70
As we see in the result, the covariance in the males (598g) are little bit higher than in females (575g). Maybe this is because the weights in the males are more dospersed, there are more range of weights than in the females.
The results also says that the males weight +114g more than the females.
Now trying to understand the code, i clean it up and try to make a fit distribution to a normal:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
df = pd.read_csv("datasets/birth_data.csv")
df["DBWT"] = pd.to_numeric(df["DBWT"], errors="coerce")
bw = df["DBWT"].dropna()
mu, sigma = norm.fit(bw)
print(f"Normal ajustada: μ={mu:.2f}, σ={sigma:.2f}")
xmin, xmax = bw.min(), bw.max()
x = np.linspace(xmin, xmax, 300)
pdf = norm.pdf(x, mu, sigma)
plt.hist(bw, bins=30, density=True, alpha=0.5, label="Data")
plt.plot(x, pdf, "r-", lw=2, label=f"Normal fitted μ={mu:.1f}, σ={sigma:.1f}")
plt.title("Normal distribution fitted to weight")
plt.xlabel("Weight (g)")
plt.legend()
plt.grid(True)
plt.show()
Normal ajustada: μ=3260.11, σ=589.55
there is a little deviation on the left (we know that its because there are more data on lower weights than in higher.
So lets try to make it fit to a gamma distribution:
from scipy.stats import gamma
shape, loc, scale = gamma.fit(bw)
print("Gamma parameters:", shape, loc, scale)
pdf = gamma.pdf(x, shape, loc=loc, scale=scale)
plt.hist(bw, bins=30, density=True, alpha=0.5)
plt.plot(x, pdf, "m-", label="Gamma fitted")
plt.legend()
plt.grid(True)
plt.show()
Gamma parameters: 411.3949066569892 -9083.014801092791 29.981664624790938
still there are a little deviation to the left.