pip install fastf1

Requirement already satisfied: fastf1 in /opt/conda/lib/python3.13/site-packages (3.7.0)
Requirement already satisfied: cryptography in /opt/conda/lib/python3.13/site-packages (from fastf1) (46.0.3)
Requirement already satisfied: matplotlib<4.0.0,>=3.5.1 in /opt/conda/lib/python3.13/site-packages (from fastf1) (3.10.7)
Requirement already satisfied: numpy<3.0.0,>=1.23.1 in /opt/conda/lib/python3.13/site-packages (from fastf1) (2.3.3)
Requirement already satisfied: pandas<3.0.0,>=1.4.1 in /opt/conda/lib/python3.13/site-packages (from fastf1) (2.3.3)
Requirement already satisfied: platformdirs in /opt/conda/lib/python3.13/site-packages (from fastf1) (4.5.0)
Requirement already satisfied: pyjwt in /opt/conda/lib/python3.13/site-packages (from fastf1) (2.10.1)
Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.13/site-packages (from fastf1) (2.9.0.post0)
Requirement already satisfied: rapidfuzz in /opt/conda/lib/python3.13/site-packages (from fastf1) (3.14.3)
Requirement already satisfied: requests-cache>=1.0.0 in /opt/conda/lib/python3.13/site-packages (from fastf1) (1.2.1)
Requirement already satisfied: requests>=2.28.1 in /opt/conda/lib/python3.13/site-packages (from fastf1) (2.32.5)
Requirement already satisfied: scipy<2.0.0,>=1.8.1 in /opt/conda/lib/python3.13/site-packages (from fastf1) (1.16.2)
Requirement already satisfied: signalrcore in /opt/conda/lib/python3.13/site-packages (from fastf1) (0.9.5)
Requirement already satisfied: timple>=0.1.6 in /opt/conda/lib/python3.13/site-packages (from fastf1) (0.1.8)
Requirement already satisfied: websockets>=10.3 in /opt/conda/lib/python3.13/site-packages (from fastf1) (15.0.1)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (1.3.3)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (4.60.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (1.4.9)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (25.0)
Requirement already satisfied: pillow>=8 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (11.3.0)
Requirement already satisfied: pyparsing>=3 in /opt/conda/lib/python3.13/site-packages (from matplotlib<4.0.0,>=3.5.1->fastf1) (3.2.5)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.13/site-packages (from pandas<3.0.0,>=1.4.1->fastf1) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.13/site-packages (from pandas<3.0.0,>=1.4.1->fastf1) (2025.2)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.13/site-packages (from python-dateutil->fastf1) (1.17.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /opt/conda/lib/python3.13/site-packages (from requests>=2.28.1->fastf1) (3.4.4)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.13/site-packages (from requests>=2.28.1->fastf1) (3.11)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.13/site-packages (from requests>=2.28.1->fastf1) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.13/site-packages (from requests>=2.28.1->fastf1) (2025.10.5)
Requirement already satisfied: attrs>=21.2 in /opt/conda/lib/python3.13/site-packages (from requests-cache>=1.0.0->fastf1) (25.4.0)
Requirement already satisfied: cattrs>=22.2 in /opt/conda/lib/python3.13/site-packages (from requests-cache>=1.0.0->fastf1) (25.3.0)
Requirement already satisfied: url-normalize>=1.4 in /opt/conda/lib/python3.13/site-packages (from requests-cache>=1.0.0->fastf1) (2.2.1)
Requirement already satisfied: typing-extensions>=4.14.0 in /opt/conda/lib/python3.13/site-packages (from cattrs>=22.2->requests-cache>=1.0.0->fastf1) (4.15.0)
Requirement already satisfied: cffi>=1.14 in /opt/conda/lib/python3.13/site-packages (from cryptography->fastf1) (2.0.0)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.13/site-packages (from cffi>=1.14->cryptography->fastf1) (2.22)
Requirement already satisfied: websocket-client==1.0.0 in /opt/conda/lib/python3.13/site-packages (from signalrcore->fastf1) (1.0.0)
Requirement already satisfied: msgpack==1.0.2 in /opt/conda/lib/python3.13/site-packages (from signalrcore->fastf1) (1.0.2)
Note: you may need to restart the kernel to use updated packages.

import fastf1
import pandas as pd
import numpy as np

from typing import Tuple

# Using Bahrain this time
def load_session(year: int = 2025, gp: str = "Bahrain", session_name: str = "R"):
    '''
    Load an F1 session using FastF1.

    Inputs
    ------
    year : int
        Championship year, e.g. 2025
    gp : str
        Grand Prix name, e.g. "Bahrain"
    session_name : str
        Session code: "FP1", "FP2", "FP3", "Q", "R"

    Output
    ------
    session : fastf1.core.Session
        A FastF1 session object with timing + lap data.
    '''
    # Cache makes reruns *much* faster after the first download.
    fastf1.Cache.enable_cache("fastf1_cache_dir")

    session = fastf1.get_session(year, gp, session_name)
    session.load()  # downloads timing data (first time only)
    return session


def build_model_table(session) -> pd.DataFrame:
    '''
    Turn FastF1 lap data into a clean ML table.

    We purposely do NOT use pick_quicklaps(). We keep *all* laps that have
    a valid LapTime + sector times + the required features.

    Inputs
    ------
    session : fastf1.core.Session

    Output
    ------
    df : pd.DataFrame
        Clean modeling table with:
        - TyreLife
        - Compound
        - Sector1Time_s, Sector2Time_s, Sector3Time_s
        - TrackStatus
        - LapTime_s (target)
    '''
    laps = session.laps.copy()  # all laps available

    # Convert time columns to seconds (float)
    def to_seconds(series):
        return series.dt.total_seconds()

    df = pd.DataFrame({
        "Driver": laps["Driver"],
        "TyreLife": laps["TyreLife"],
        "Compound": laps["Compound"],
        "TrackStatus": laps["TrackStatus"],
        "Sector1TimeSeconds": to_seconds(laps["Sector1Time"]),
        "Sector2TimeSeconds": to_seconds(laps["Sector2Time"]),
        "Sector3TimeSeconds": to_seconds(laps["Sector3Time"]),
        "LapTimeSeconds": to_seconds(laps["LapTime"]),
    })

    # Basic cleaning: keep rows where the model can actually learn
    df = df.dropna(subset=[
        "TyreLife", "Compound", "TrackStatus",
        "Sector1TimeSeconds", "Sector2TimeSeconds", "Sector3TimeSeconds", "LapTimeSeconds"
    ]).reset_index(drop=True)

    # Keep only the 3 compounds asked for (some sessions have INTER/WET)
    df = df[df["Compound"].isin(["SOFT", "MEDIUM", "HARD"])].reset_index(drop=True)

    # TrackStatus is usually a string that *looks* like a number.
    # We keep it numeric so models can use it.
    df["TrackStatus"] = pd.to_numeric(df["TrackStatus"], errors="coerce")
    df = df.dropna(subset=["TrackStatus"]).reset_index(drop=True)
    df["TrackStatus"] = df["TrackStatus"].astype(int)

    return df

session = load_session(year=2025, gp="Bahrain", session_name="R")
df = build_model_table(session)
df.shape

core           INFO 	Loading data for Bahrain Grand Prix - Race [v3.7.0]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for car_data
req            INFO 	Using cached data for position_data
req            INFO 	Using cached data for weather_data
req            INFO 	Using cached data for race_control_messages
core           INFO 	Finished loading data for 20 drivers: ['81', '63', '4', '16', '44', '1', '10', '31', '22', '87', '12', '23', '6', '7', '14', '30', '18', '5', '55', '27']

(1075, 8)

import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import gaussian_kde
import numpy as np

# Use seaborn's default clean style for readability
sns.set_theme()


# 1. KDE using seaborn (all laps)

fig, ax = plt.subplots(figsize=(8, 4))

# seaborn.kdeplot:
# - estimates the probability density of lap times
# - smoother and more informative than a histogram
# - automatically chooses bandwidth (how smooth the curve is)
sns.kdeplot(
    data=df,
    x="LapTimeSeconds",
    ax=ax
)

ax.set_title("KDE of LapTimeSeconds (seaborn)")
ax.set_xlabel("seconds")

# Show the figure
plt.show()


# 2. KDE split by tyre compound

fig, ax = plt.subplots(figsize=(8, 4))

# Adding 'hue' overlays one KDE per compound
# This lets us compare distributions directly
sns.kdeplot(
    data=df,
    x="LapTimeSeconds",
    hue="Compound",
    ax=ax
)

ax.set_title("KDE of LapTimeSeconds by Compound")
ax.set_xlabel("seconds")

plt.show()


# # 3. KDE using scipy (manual approach)

# # Convert lap times to NumPy array
# # scipy works directly with numerical arrays
# vals = df["LapTimeSeconds"].to_numpy()

# # gaussian_kde:
# # - fits a smooth probability density function
# # - assumes the data is continuous
# # - gives more control than seaborn
# kde = gaussian_kde(vals)

# # Create evenly spaced x-values across the data range
# # This defines where the KDE will be evaluated
# x_grid = np.linspace(vals.min(), vals.max(), 300)

# # Evaluate the KDE at each x-value
# y_grid = kde(x_gri)

# fig, ax = plt.subplots(figsize=(8, 4))

# # Plot the manually computed KDE
# ax.plot(x_grid, y_grid, linewidth=2)

# ax.set_title("KDE of LapTimeSeconds (scipy gaussian_kde)")
# ax.set_xlabel("LapTimeSeconds (sec)")
# ax.set_ylabel("density")

# plt.show()

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
import numpy as np

# Pick the input features
feature_cols = [
    "TyreLife",
    "Compound",
    "Sector1TimeSeconds",
    "Sector2TimeSeconds",
    "Sector3TimeSeconds",
    "TrackStatus"
]
X = df[feature_cols].copy()


# Split columns by type
# Numeric columns: scale them so each contributes fairly to distance calculations
num_cols = ["TyreLife", "Sector1TimeSeconds", "Sector2TimeSeconds", "Sector3TimeSeconds", "TrackStatus"]

# Categorical column: one-hot encode to turn "SOFT/MEDIUM/HARD" into numeric columns
cat_cols = ["Compound"]


# Preprocess: scale + encode
prep = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ]
)


# Cluster with KMeans
k = 4
kmeans = KMeans(n_clusters=k, random_state=7, n_init="auto")

# Pipeline keeps preprocessing + clustering together (same steps every time)
pipe = Pipeline([
    ("prep", prep),
    ("kmeans", kmeans)
])

# Fit the pipeline (prep happens, then kmeans runs on the transformed data)
pipe.fit(X)

# Cluster label for each row

Pipeline(steps=[('prep',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['TyreLife',
                                                   'Sector1TimeSeconds',
                                                   'Sector2TimeSeconds',
                                                   'Sector3TimeSeconds',
                                                   'TrackStatus']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Compound'])])),
                ('kmeans', KMeans(n_clusters=4, random_state=7))])

['TyreLife', 'Sector1TimeSeconds', 'Sector2TimeSeconds', 'Sector3TimeSeconds', 'TrackStatus']

['Compound']

import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(figsize=(7, 5))

for lab in np.unique(labels):
    m = labels == lab
    ax.scatter(
        df.loc[m, "TyreLife"],
        df.loc[m, "LapTimeSeconds"],
        s=10,
        alpha=0.6,
        label=f"Cluster {lab}"
    )

ax.set_title("K-Means clusters (TyreLife vs LapTime)")
ax.set_xlabel("Tyre Life (laps)")
ax.set_ylabel("Lap Time (seconds)")
ax.legend()
plt.show()

tmp = df.copy()
tmp["cluster"] = labels
tmp.groupby("cluster")[["LapTimeSeconds","TyreLife","Sector1TimeSeconds","Sector2TimeSeconds","Sector3TimeSeconds","TrackStatus"]].mean().sort_values("LapTimeSeconds")

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.mixture import GaussianMixture


# Define the GMM model
# GaussianMixture:
# - clusters data by fitting multiple Gaussian "clouds"
# - unlike KMeans, it can model:
#   * elliptical clusters (not just circular)
#   * clusters of different sizes/spreads
#   * uncertainty (probabilities) for each cluster
gmm_k = 4
gmm = GaussianMixture(
    n_components=gmm_k,
    covariance_type="full",  # "full" lets each cluster have its own ellipse shape
    random_state=7
)


# Build a pipeline 
# We still need preprocessing:
# - scale numeric columns (distance/probability math works better)
# - one-hot encode Compound (so it becomes numeric)
#
# Then fit GMM in that full transformed feature space.
gmm_pipe = Pipeline([
    ("prep", prep),
    ("gmm", gmm)
])

# fit() trains the preprocessing and the GMM.
gmm_pipe.fit(X)

# Transform X the SAME way GMM saw it (scaled + encoded)
X_prepped = gmm_pipe.named_steps["prep"].transform(X)

# predict_proba gives, for each lap:
# - probability it belongs to each cluster
probs = gmm_pipe.named_steps["gmm"].predict_proba(X_prepped)

# Hard label = cluster with highest probability
gmm_labels = probs.argmax(axis=1)

# Uncertainty:
# - if max probability is close to 1 → confident assignment
# - if max probability is low → model is unsure (point sits between clusters)
uncertainty = 1 - probs.max(axis=1)

# Quick checks
probs.shape, uncertainty[:5]

((1075, 4),
 array([1.50134660e-09, 6.92601532e-10, 6.43350262e-10, 3.82465615e-10,
        1.93276506e-10]))

fig, ax = plt.subplots(figsize=(7, 5))

for lab in np.unique(gmm_labels):
    m = gmm_labels == lab
    ax.scatter(
        df.loc[m, "TyreLife"],
        df.loc[m, "LapTimeSeconds"],
        s=10,
        alpha=0.6,
        label=f"cluster {lab}"
    )

ax.set_title("GMM clusters (TyreLife vs LapTime)")
ax.set_xlabel("Tyre Life (laps)")
ax.set_ylabel("Lap Time (seconds)")
ax.legend()
plt.show()

	steps	[('prep', ...), ('kmeans', ...)]
	transform_input	None
	memory	None
	verbose	False

	transformers	[('num', ...), ('cat', ...)]
	remainder	'drop'
	sparse_threshold	0.3
	n_jobs	None
	transformer_weights	None
	verbose	False
	verbose_feature_names_out	True
	force_int_remainder_cols	'deprecated'

	categories	'auto'
	drop	None
	sparse_output	True
	dtype	<class 'numpy.float64'>
	handle_unknown	'ignore'
	min_frequency	None
	max_categories	None
	feature_name_combiner	'concat'

	n_clusters	4
	init	'k-means++'
	n_init	'auto'
	max_iter	300
	tol	0.0001
	verbose	0
	random_state	7
	copy_x	True
	algorithm	'lloyd'

	LapTimeSeconds	TyreLife	Sector1TimeSeconds	Sector2TimeSeconds	Sector3TimeSeconds	TrackStatus
cluster
0	98.644061	12.085193	31.351135	42.958952	24.333974	1.018256
2	118.853733	1.266667	51.969333	42.589933	24.294467	1.200000
3	125.864884	8.651163	36.836977	54.948698	34.079209	12.093023
1	138.337187	5.000000	39.860937	59.581000	38.895250	41.000000

Week 03: KDE, K-Means & GMM

Libraries¶

DataFrame¶

Load data¶

Gaussian KDE (Kernel Density Estimation)¶

Interpretation¶

Explanation:¶

Plot 1: KDE of LapTimeSeconds (all laps together)¶

Plot 2: KDE by Compound (all laps together)¶

Plot 3: KDE using Guassian KDE¶

Additional Explanation¶

K-Means¶

Explanation¶

Interpretation:¶

Inside each cluser¶

Gaussian Mixture Model (GMM)¶

Preparing Data for Probalistic Modeling¶

Fitting a Gaussian Mixture Model (GMM)¶

Get cluster probabilities and labels¶

Visualize GMM Components¶

Explanation¶

	copy	True
	with_mean	True
	with_std	True