Week 03: KDE, K-Means & GMM
Libraries¶
- fastf1
- numpy
- scipy.stats - Probability functions, KDE
- sklearn.cluster.KMeans - Unsupervised clustering
- matplotlib.pyplot
In [35]:
pip install fastf1
Requirement already satisfied: fastf1 in /opt/conda/lib/python3.13/site-packages (3.7.0) Requirement already satisfied: cryptography in /opt/conda/lib/python3.13/site-packages (from fastf1) (46.0.3) Requirement already satisfied: matplotlib<4.0.0,>=3.5.1 in /opt/conda/lib/python3.13/site-packages (from fastf1) (3.10.7) Requirement already satisfied: numpy<3.0.0,>=1.23.1 in /opt/conda/lib/python3.13/site-packages (from fastf1) (2.3.3) Requirement already satisfied: pandas<3.0.0,>=1.4.1 in /opt/conda/lib/python3.13/site-packages (from fastf1) (2.3.3) Requirement already satisfied: platformdirs in /opt/conda/lib/python3.13/site-packages (from fastf1) (4.5.0) Requirement already satisfied: pyjwt in /opt/conda/lib/python3.13/site-packages (from fastf1) (2.10.1) Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.13/site-packages (from fastf1) (2.9.0.post0) Requirement already satisfied: rapidfuzz in /opt/conda/lib/python3.13/site-packages (from fastf1) (3.14.3) Requirement already satisfied: requests-cache>=1.0.0 in /opt/conda/lib/python3.13/site-packages (from fastf1) (1.2.1) Requirement already satisfied: requests>=2.28.1 in /opt/conda/lib/python3.13/site-packages (from fastf1) (2.32.5) Requirement already satisfied: scipy<2.0.0,>=1.8.1 in /opt/conda/lib/python3.13/site-packages (from fastf1) (1.16.2) Requirement already satisfied: signalrcore in /opt/conda/lib/python3.13/site-packages (from fastf1) (0.9.5) Requirement already satisfied: timple>=0.1.6 in /opt/conda/lib/python3.13/site-packages (from fastf1) (0.1.8) Requirement already satisfied: websockets>=10.3 in /opt/conda/lib/python3.13/site-packages (from fastf1) (15.0.1) Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (1.3.3) Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (4.60.1) Requirement already satisfied: kiwisolver>=1.3.1 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (1.4.9) Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (25.0) Requirement already satisfied: pillow>=8 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (11.3.0) Requirement already satisfied: pyparsing>=3 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (3.2.5) Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.13/site-packages (from pandas<3.0.0,>=1.4.1->fastf1) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.13/site-packages (from pandas<3.0.0,>=1.4.1->fastf1) (2025.2) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.13/site-packages (from python-dateutil->fastf1) (1.17.0) Requirement already satisfied: charset_normalizer<4,>=2 in /opt/conda/lib/python3.13/site-packages (from requests>=2.28.1->fastf1) (3.4.4) Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.13/site-packages (from requests>=2.28.1->fastf1) (3.11) Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.13/site-packages (from requests>=2.28.1->fastf1) (2.5.0) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.13/site-packages (from requests>=2.28.1->fastf1) (2025.10.5) Requirement already satisfied: attrs>=21.2 in /opt/conda/lib/python3.13/site-packages (from requests-cache>=1.0.0->fastf1) (25.4.0) Requirement already satisfied: cattrs>=22.2 in /opt/conda/lib/python3.13/site-packages (from requests-cache>=1.0.0->fastf1) (25.3.0) Requirement already satisfied: url-normalize>=1.4 in /opt/conda/lib/python3.13/site-packages (from requests-cache>=1.0.0->fastf1) (2.2.1) Requirement already satisfied: typing-extensions>=4.14.0 in /opt/conda/lib/python3.13/site-packages (from cattrs>=22.2->requests-cache>=1.0.0->fastf1) (4.15.0) Requirement already satisfied: cffi>=1.14 in /opt/conda/lib/python3.13/site-packages (from cryptography->fastf1) (2.0.0) Requirement already satisfied: pycparser in /opt/conda/lib/python3.13/site-packages (from cffi>=1.14->cryptography->fastf1) (2.22) Requirement already satisfied: websocket-client==1.0.0 in /opt/conda/lib/python3.13/site-packages (from signalrcore->fastf1) (1.0.0) Requirement already satisfied: msgpack==1.0.2 in /opt/conda/lib/python3.13/site-packages (from signalrcore->fastf1) (1.0.2) Note: you may need to restart the kernel to use updated packages.
In [ ]:
DataFrame¶
Same as in Week 02
In [36]:
import fastf1
import pandas as pd
import numpy as np
from typing import Tuple
# Using Bahrain this time
def load_session(year: int = 2025, gp: str = "Bahrain", session_name: str = "R"):
'''
Load an F1 session using FastF1.
Inputs
------
year : int
Championship year, e.g. 2025
gp : str
Grand Prix name, e.g. "Bahrain"
session_name : str
Session code: "FP1", "FP2", "FP3", "Q", "R"
Output
------
session : fastf1.core.Session
A FastF1 session object with timing + lap data.
'''
# Cache makes reruns *much* faster after the first download.
fastf1.Cache.enable_cache("fastf1_cache_dir")
session = fastf1.get_session(year, gp, session_name)
session.load() # downloads timing data (first time only)
return session
def build_model_table(session) -> pd.DataFrame:
'''
Turn FastF1 lap data into a clean ML table.
We purposely do NOT use pick_quicklaps(). We keep *all* laps that have
a valid LapTime + sector times + the required features.
Inputs
------
session : fastf1.core.Session
Output
------
df : pd.DataFrame
Clean modeling table with:
- TyreLife
- Compound
- Sector1Time_s, Sector2Time_s, Sector3Time_s
- TrackStatus
- LapTime_s (target)
'''
laps = session.laps.copy() # all laps available
# Convert time columns to seconds (float)
def to_seconds(series):
return series.dt.total_seconds()
df = pd.DataFrame({
"Driver": laps["Driver"],
"TyreLife": laps["TyreLife"],
"Compound": laps["Compound"],
"TrackStatus": laps["TrackStatus"],
"Sector1TimeSeconds": to_seconds(laps["Sector1Time"]),
"Sector2TimeSeconds": to_seconds(laps["Sector2Time"]),
"Sector3TimeSeconds": to_seconds(laps["Sector3Time"]),
"LapTimeSeconds": to_seconds(laps["LapTime"]),
})
# Basic cleaning: keep rows where the model can actually learn
df = df.dropna(subset=[
"TyreLife", "Compound", "TrackStatus",
"Sector1TimeSeconds", "Sector2TimeSeconds", "Sector3TimeSeconds", "LapTimeSeconds"
]).reset_index(drop=True)
# Keep only the 3 compounds asked for (some sessions have INTER/WET)
df = df[df["Compound"].isin(["SOFT", "MEDIUM", "HARD"])].reset_index(drop=True)
# TrackStatus is usually a string that *looks* like a number.
# We keep it numeric so models can use it.
df["TrackStatus"] = pd.to_numeric(df["TrackStatus"], errors="coerce")
df = df.dropna(subset=["TrackStatus"]).reset_index(drop=True)
df["TrackStatus"] = df["TrackStatus"].astype(int)
return df
Load data¶
In [37]:
session = load_session(year=2025, gp="Bahrain", session_name="R")
df = build_model_table(session)
df.shape
core INFO Loading data for Bahrain Grand Prix - Race [v3.7.0] req INFO Using cached data for session_info req INFO Using cached data for driver_info req INFO Using cached data for session_status_data req INFO Using cached data for lap_count req INFO Using cached data for track_status_data req INFO Using cached data for _extended_timing_data req INFO Using cached data for timing_app_data core INFO Processing timing data... req INFO Using cached data for car_data req INFO Using cached data for position_data req INFO Using cached data for weather_data req INFO Using cached data for race_control_messages core INFO Finished loading data for 20 drivers: ['81', '63', '4', '16', '44', '1', '10', '31', '22', '87', '12', '23', '6', '7', '14', '30', '18', '5', '55', '27']
Out[37]:
(1075, 8)
In [38]:
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import gaussian_kde
import numpy as np
# Use seaborn's default clean style for readability
sns.set_theme()
# 1. KDE using seaborn (all laps)
fig, ax = plt.subplots(figsize=(8, 4))
# seaborn.kdeplot:
# - estimates the probability density of lap times
# - smoother and more informative than a histogram
# - automatically chooses bandwidth (how smooth the curve is)
sns.kdeplot(
data=df,
x="LapTimeSeconds",
ax=ax
)
ax.set_title("KDE of LapTimeSeconds (seaborn)")
ax.set_xlabel("seconds")
# Show the figure
plt.show()
# 2. KDE split by tyre compound
fig, ax = plt.subplots(figsize=(8, 4))
# Adding 'hue' overlays one KDE per compound
# This lets us compare distributions directly
sns.kdeplot(
data=df,
x="LapTimeSeconds",
hue="Compound",
ax=ax
)
ax.set_title("KDE of LapTimeSeconds by Compound")
ax.set_xlabel("seconds")
plt.show()
# # 3. KDE using scipy (manual approach)
# # Convert lap times to NumPy array
# # scipy works directly with numerical arrays
# vals = df["LapTimeSeconds"].to_numpy()
# # gaussian_kde:
# # - fits a smooth probability density function
# # - assumes the data is continuous
# # - gives more control than seaborn
# kde = gaussian_kde(vals)
# # Create evenly spaced x-values across the data range
# # This defines where the KDE will be evaluated
# x_grid = np.linspace(vals.min(), vals.max(), 300)
# # Evaluate the KDE at each x-value
# y_grid = kde(x_gri)
# fig, ax = plt.subplots(figsize=(8, 4))
# # Plot the manually computed KDE
# ax.plot(x_grid, y_grid, linewidth=2)
# ax.set_title("KDE of LapTimeSeconds (scipy gaussian_kde)")
# ax.set_xlabel("LapTimeSeconds (sec)")
# ax.set_ylabel("density")
# plt.show()
Explanation:¶
Plot 1: KDE of LapTimeSeconds (all laps together)¶
What is shown
- One big peak around ~97–100 seconds
- A long tail to the right (110 - 150+ seconds)
What that means
- Most laps are clustered around race pace
- A few laps are much slower: pit in/out laps, traffic, mistakes, tyre drop-off, etc
It's right skewed which means - many nromal laps.
Plot 2: KDE by Compound (all laps together)¶
What is shown
- SOFT: slightly faster peak, tighter spread
- MEDIUM: very sharp peak (most consistent)
- HARD: wider curve with heavier slow tail
What that means
- Most laps are clustered around race pace
- A few laps are much slower: pit in/out laps, traffic, mistakes, tyre drop-off, etc
It's right skewed which means - many nromal laps.
Plot 3: KDE using Guassian KDE¶
Looks similar to Plot 1
It’s mainly for:
- learning what KDE actually does
- full customization
- exporting values
It's right skewed which means - many nromal laps.
Additional Explanation¶
- A smooth curve representing the estimated probability density function (PDF) of lap times.
- Peaks in the curve correspond to lap times that occur more frequently.
- Valleys indicate rare lap times.
- Compared to a histogram:
- KDE is continuous, not discrete bins
- Easier to detect modes, skewness, and patterns
Example Interpretation:
- A single peak - most laps clustered around one lap time
- Two peaks - bimodal distribution (e.g., laps on worn vs fresh tyres)
- Long tail - occasional unusually fast or slow laps
K-Means¶
K-means groups data into K clusters by distance.
In [39]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
import numpy as np
# Pick the input features
feature_cols = [
"TyreLife",
"Compound",
"Sector1TimeSeconds",
"Sector2TimeSeconds",
"Sector3TimeSeconds",
"TrackStatus"
]
X = df[feature_cols].copy()
# Split columns by type
# Numeric columns: scale them so each contributes fairly to distance calculations
num_cols = ["TyreLife", "Sector1TimeSeconds", "Sector2TimeSeconds", "Sector3TimeSeconds", "TrackStatus"]
# Categorical column: one-hot encode to turn "SOFT/MEDIUM/HARD" into numeric columns
cat_cols = ["Compound"]
# Preprocess: scale + encode
prep = ColumnTransformer(
transformers=[
("num", StandardScaler(), num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
]
)
# Cluster with KMeans
k = 4
kmeans = KMeans(n_clusters=k, random_state=7, n_init="auto")
# Pipeline keeps preprocessing + clustering together (same steps every time)
pipe = Pipeline([
("prep", prep),
("kmeans", kmeans)
])
# Fit the pipeline (prep happens, then kmeans runs on the transformed data)
pipe.fit(X)
# Cluster label for each row
Out[39]:
Pipeline(steps=[('prep',
ColumnTransformer(transformers=[('num', StandardScaler(),
['TyreLife',
'Sector1TimeSeconds',
'Sector2TimeSeconds',
'Sector3TimeSeconds',
'TrackStatus']),
('cat',
OneHotEncoder(handle_unknown='ignore'),
['Compound'])])),
('kmeans', KMeans(n_clusters=4, random_state=7))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('prep', ...), ('kmeans', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| transformers | [('num', ...), ('cat', ...)] | |
| remainder | 'drop' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | True | |
| force_int_remainder_cols | 'deprecated' |
['TyreLife', 'Sector1TimeSeconds', 'Sector2TimeSeconds', 'Sector3TimeSeconds', 'TrackStatus']
Parameters
| copy | True | |
| with_mean | True | |
| with_std | True |
['Compound']
Parameters
| categories | 'auto' | |
| drop | None | |
| sparse_output | True | |
| dtype | <class 'numpy.float64'> | |
| handle_unknown | 'ignore' | |
| min_frequency | None | |
| max_categories | None | |
| feature_name_combiner | 'concat' |
Parameters
| n_clusters | 4 | |
| init | 'k-means++' | |
| n_init | 'auto' | |
| max_iter | 300 | |
| tol | 0.0001 | |
| verbose | 0 | |
| random_state | 7 | |
| copy_x | True | |
| algorithm | 'lloyd' |
In [40]:
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(figsize=(7, 5))
for lab in np.unique(labels):
m = labels == lab
ax.scatter(
df.loc[m, "TyreLife"],
df.loc[m, "LapTimeSeconds"],
s=10,
alpha=0.6,
label=f"Cluster {lab}"
)
ax.set_title("K-Means clusters (TyreLife vs LapTime)")
ax.set_xlabel("Tyre Life (laps)")
ax.set_ylabel("Lap Time (seconds)")
ax.legend()
plt.show()
Explanation¶
- Each dot = one lap
- X-axis (Tyre Life) = how many laps the tyre has done
- Y-axis (Lap Time) = how long the lap took (seconds)
- Color = the K-means cluster label (0–3
Blue
- normal race pace
- clearn laps
Red
- Slow laps
Orange
- Very slow - could be pit laps, safety car, etc
Green
- Slow but consistent
- TrackStatus conditions
Interpretation:¶
K-means is helping with the separation of lap modes
Instead of staring at averages, we know:
- “Cluster 0 = clean laps”
- “Cluster 1 = very slow laps”
- “Cluster 3 = disrupted laps”
Inside each cluser¶
In [41]:
tmp = df.copy()
tmp["cluster"] = labels
tmp.groupby("cluster")[["LapTimeSeconds","TyreLife","Sector1TimeSeconds","Sector2TimeSeconds","Sector3TimeSeconds","TrackStatus"]].mean().sort_values("LapTimeSeconds")
Out[41]:
| LapTimeSeconds | TyreLife | Sector1TimeSeconds | Sector2TimeSeconds | Sector3TimeSeconds | TrackStatus | |
|---|---|---|---|---|---|---|
| cluster | ||||||
| 0 | 98.644061 | 12.085193 | 31.351135 | 42.958952 | 24.333974 | 1.018256 |
| 2 | 118.853733 | 1.266667 | 51.969333 | 42.589933 | 24.294467 | 1.200000 |
| 3 | 125.864884 | 8.651163 | 36.836977 | 54.948698 | 34.079209 | 12.093023 |
| 1 | 138.337187 | 5.000000 | 39.860937 | 59.581000 | 38.895250 | 41.000000 |
Gaussian Mixture Model (GMM)¶
- Assigns probabilities (soft clustering)
- Clusters can be elliptical and have different spreads
- Can detect “borderline” laps using uncertainty
Preparing Data for Probalistic Modeling¶
Lap times in F1 are not due to a single process. Therefore using a Gaussian Mixture Model assumes that data comes from multiple overlapping distributions.
In [42]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.mixture import GaussianMixture
# Define the GMM model
# GaussianMixture:
# - clusters data by fitting multiple Gaussian "clouds"
# - unlike KMeans, it can model:
# * elliptical clusters (not just circular)
# * clusters of different sizes/spreads
# * uncertainty (probabilities) for each cluster
gmm_k = 4
gmm = GaussianMixture(
n_components=gmm_k,
covariance_type="full", # "full" lets each cluster have its own ellipse shape
random_state=7
)
# Build a pipeline
# We still need preprocessing:
# - scale numeric columns (distance/probability math works better)
# - one-hot encode Compound (so it becomes numeric)
#
# Then fit GMM in that full transformed feature space.
gmm_pipe = Pipeline([
("prep", prep),
("gmm", gmm)
])
Fitting a Gaussian Mixture Model (GMM)¶
In [43]:
# fit() trains the preprocessing and the GMM.
gmm_pipe.fit(X)
# Transform X the SAME way GMM saw it (scaled + encoded)
X_prepped = gmm_pipe.named_steps["prep"].transform(X)
Get cluster probabilities and labels¶
In [44]:
# predict_proba gives, for each lap:
# - probability it belongs to each cluster
probs = gmm_pipe.named_steps["gmm"].predict_proba(X_prepped)
# Hard label = cluster with highest probability
gmm_labels = probs.argmax(axis=1)
# Uncertainty:
# - if max probability is close to 1 → confident assignment
# - if max probability is low → model is unsure (point sits between clusters)
uncertainty = 1 - probs.max(axis=1)
# Quick checks
probs.shape, uncertainty[:5]
Out[44]:
((1075, 4),
array([1.50134660e-09, 6.92601532e-10, 6.43350262e-10, 3.82465615e-10,
1.93276506e-10]))
Visualize GMM Components¶
In [45]:
fig, ax = plt.subplots(figsize=(7, 5))
for lab in np.unique(gmm_labels):
m = gmm_labels == lab
ax.scatter(
df.loc[m, "TyreLife"],
df.loc[m, "LapTimeSeconds"],
s=10,
alpha=0.6,
label=f"cluster {lab}"
)
ax.set_title("GMM clusters (TyreLife vs LapTime)")
ax.set_xlabel("Tyre Life (laps)")
ax.set_ylabel("Lap Time (seconds)")
ax.legend()
plt.show()
Explanation¶
- Both the K-means clustering plot and GMM plot look similar (so I had to check and turns out that happens a lot, and I mean a lot)
- But the way GMM goes a little further than K-Means, is when you calculate the uncertaintiy (which our friend K-means is not capable of... at least from what I've read but I could be wrong)
In [ ]: