Week 02: FastF1 Visual Exploration & Modeling
Majority of the code is from my interaction with ChatGPT (I edited and added where I could, but maybe not so much in the Linear Regerssion, Cross Validation and Prediction part)
- Visual exploration
- linear/quadratic fits
- Linear Regression
- cross-validation
- predicted vs actual
What you’ll need¶
- Python 3.9 or newer
pip install fastf1 pandas numpy matplotlib seaborn scikit-learn plotly scipy
Data¶
Features
- Tyre Life
- Tyre Compound (SOFT, MEDIUM, HARD)
- Sector Times (seconds)
- Track Status
Output
- Lap Time (seconds)
Note:¶
We use all available lap times (we do not call pick_quicklaps()).
Visuals¶
Goal: make the data feel real before we model it. Time to go Indiana Jones on the data set and find, "What's up?!"
We will:
- Load a session and build the model table
- Make lots of visuals (hist, scatter, pairplot, heatmap, Sankey)
- Fit a linear and quadratic function (just to see shape)
- Train a Linear Regression model with proper preprocessing
- Do cross-validation
- Plot Predicted vs Actual lap time
Getting the help of good old buddy, "ChatGPT":
"Create a subset of the FastF1 dataset for visual and function fitting using LapTimes, Compound(Tyre), Sector1Time, Sector2Time, Sector3Time, TyreLife and TrackStatus. Do not only use quick_laps and find all valid laps."
import fastf1
import pandas as pd
import numpy as np
from typing import Tuple
def load_session(year: int = 2025, gp: str = "Monza", session_name: str = "R"):
'''
Load an F1 session using FastF1.
Inputs
------
year : int
Championship year, e.g. 2024
gp : str
Grand Prix name, e.g. "Monza"
session_name : str
Session code: "FP1", "FP2", "FP3", "Q", "R"
Output
------
session : fastf1.core.Session
A FastF1 session object with timing + lap data.
'''
# Follwing was from FastF1
# Cache makes reruns *much* faster after the first download.
fastf1.Cache.enable_cache("fastf1_cache_dir")
session = fastf1.get_session(year, gp, session_name)
session.load() # downloads timing data (first time only)
return session
def build_model_table(session) -> pd.DataFrame:
'''
Turn FastF1 lap data into a clean ML table.
We purposely do NOT use pick_quicklaps(). We keep *all* laps that have
a valid LapTime + sector times + the required features.
Inputs
------
session : fastf1.core.Session
Output
------
df : pd.DataFrame
Clean modeling table with:
- TyreLife
- Compound
- Sector1TimeSeconds, Sector2TimeSeconds, Sector3TimeSeconds
- TrackStatus
- LapTime_s (target)
'''
laps = session.laps.copy() # all laps available
# Convert time columns to seconds (float)
def to_seconds(series):
return series.dt.total_seconds()
df = pd.DataFrame({
"Driver": laps["Driver"],
"TyreLife": laps["TyreLife"],
"Compound": laps["Compound"],
"TrackStatus": laps["TrackStatus"],
"Sector1TimeSeconds": to_seconds(laps["Sector1Time"]),
"Sector2TimeSeconds": to_seconds(laps["Sector2Time"]),
"Sector3TimeSeconds": to_seconds(laps["Sector3Time"]),
"LapTimeSeconds": to_seconds(laps["LapTime"]),
})
# Basic cleaning: keep rows where the model can actually learn
df = df.dropna(subset=[
"TyreLife", "Compound", "TrackStatus",
"Sector1TimeSeconds", "Sector2TimeSeconds", "Sector3TimeSeconds", "LapTimeSeconds"
]).reset_index(drop=True)
# Keep only the 3 compounds asked for (some sessions have INTER/WET)
df = df[df["Compound"].isin(["SOFT", "MEDIUM", "HARD"])].reset_index(drop=True)
# TrackStatus is usually a string that *looks* like a number.
# We keep it numeric so models can use it.
df["TrackStatus"] = pd.to_numeric(df["TrackStatus"], errors="coerce")
df = df.dropna(subset=["TrackStatus"]).reset_index(drop=True)
df["TrackStatus"] = df["TrackStatus"].astype(int)
return df
Explanation:¶
Getting data
def load_session(year, gp, session_name):
Cache setup for faster access
fastf1.Cache.enable_cache("fastf1_cache_dir")
Load Session
session = fastf1.get_session(year, gp, session_name)
session.load()
Extracting only required data from FastF1 and returns a clean DataFrame for modeling
def build_model_table(session) -> pd.DataFrame:
Grab all laps
laps = session.laps.copy()
session.laps - raw lap table .copy() - avoids modifying FastF1's internal data and it's just good practice.
Time conversion (because FsatF1 stores times as Timedelta
def to_seconds(series):
return series.dt.total_seconds()
Building the clean DataFrame using pandas with only features required
df = pd.DataFrame()
Basic cleaning to drop incomplete or NaN rows
df = df.dropna(subset=[...]).reset_index(drop=True)
*Only using "SOFT", "MEDIUM", "HARD" tyre types for this class (Also, sadlt, we have to avoid the 'weather_table' because it's a lot of concepts to get through in a month.
df = df[df["Compound"].isin(["SOFT", "MEDIUM", "HARD"])].reset_index(drop=True)
Converting TrackStatus to Numeric and dropping incomplete sets
df["TrackStatus"] = pd.to_numeric(df["TrackStatus"], errors="coerce")
df = df.dropna(subset=["TrackStatus"]).reset_index(drop=True)
df["TrackStatus"] = df["TrackStatus"].astype(int)
Load the Data¶
session = load_session(year=2025, gp="Monza", session_name="R")
df = build_model_table(session)
df.head(), df.shape
core INFO Loading data for Italian Grand Prix - Race [v3.7.0] req INFO Using cached data for session_info req INFO Using cached data for driver_info req INFO Using cached data for session_status_data req INFO Using cached data for lap_count req INFO Using cached data for track_status_data req INFO Using cached data for _extended_timing_data req INFO Using cached data for timing_app_data core INFO Processing timing data... req INFO Using cached data for car_data req INFO Using cached data for position_data req INFO Using cached data for weather_data req INFO Using cached data for race_control_messages core INFO Finished loading data for 20 drivers: ['1', '4', '81', '16', '63', '44', '23', '5', '12', '6', '55', '87', '22', '30', '31', '10', '43', '18', '14', '27']
( Driver TyreLife Compound TrackStatus Sector1TimeSeconds \
0 VER 2.0 MEDIUM 1 28.457
1 VER 3.0 MEDIUM 1 27.212
2 VER 4.0 MEDIUM 1 27.375
3 VER 5.0 MEDIUM 1 27.520
4 VER 6.0 MEDIUM 1 27.434
Sector2TimeSeconds Sector3TimeSeconds LapTimeSeconds
0 28.843 27.559 84.859
1 28.713 27.587 83.512
2 28.455 27.432 83.262
3 28.427 27.641 83.588
4 28.496 27.646 83.576 ,
(955, 8))
What’s inside df?¶
- TyreLife: how old the tyre is (in laps)
- Compound: SOFT / MEDIUM / HARD
- Sector1/2/3TimeSeconds: seconds spent in each sector
- TrackStatus: a code for what’s happening on track (yellow flags, safety car, etc.)
- LapTimeSeconds: the total lap time in seconds (our target)
Quick sanity checks (Recommended in the ChatGPT: First steps to analyzing cleaned data sets)¶
It means that, "Before I trust this data or train a model, let me make sure nothing in the new data set is out of the ordinary" or carzy (not doing deep analysis yet, just checking).
Why: If lap times have weird spikes, it is important to take note before modeling.
Input: df
Output: histograms
import matplotlib.pyplot as plt
import seaborn as sns # better design template
sns.set_theme() # using seaborn theme
fig, ax = plt.subplots(figsize=(8,4))
ax.hist(df["LapTimeSeconds"], bins=40)
ax.set_title("LapTimeSeconds distribution")
ax.set_xlabel("seconds")
ax.set_ylabel("count")
plt.show()
fig, ax = plt.subplots(figsize=(8,4))
ax.hist(df["TyreLife"], bins=30)
ax.set_title("TyreLife distribution")
ax.set_xlabel("laps")
ax.set_ylabel("count")
plt.show()
Explanation:¶
Create figure with axis:
fig, ax = plt.subplots(figsize=(8,4))
Draw Histogram and split into 40 bins (each bar = how many laps fall in this time range)
ax.hist(df["LapTimeSeconds"], bins=40)
Labeling and Rendering Plot
ax.set_title("LapTime_s distribution")
ax.set_xlabel("seconds")
ax.set_ylabel("count")
plt.show()
Histogram: LapTimeSecondsDistribution¶
- Peak around 82-85 seconds is the normal race pace
- Bars out at 100-105 seconds could be due to Safety Car Laps, VSC laps, etc
Histogram: TyreLifeDistribution¶
- Lots of laps at low TyreLife (0–10)
- Gradual drop-off as TyreLife increases
- Long tail reaching 40–50 laps
Why Sanity Check is Important?
- There could be a lot of garbage data even with initial clearning
- Outliers might dominate new data
Scatter plots¶
Why: Scatter plots show patterns fast:
- Older tyres - often slower laps (not always)
- Sector times - strongly related to lap time (they literally add up)
Input: 2 columns
Output: relationship (trend/no trend, clusters, outliers)
fig, ax = plt.subplots(figsize=(7,5))
ax.scatter(df["TyreLife"], df["LapTimeSeconds"], s=10)
ax.set_title("TyreLife vs LapTimeSeconds")
ax.set_xlabel("TyreLife (laps)")
ax.set_ylabel("LapTimeSeconds (sec)")
plt.show()
Scatter Plot: TyreLife vs LapTimeSeconds¶
Visual Representation
- A dense horizontal cloud between ~82–85s
- A few high lap time outliers (100s+ seconds), mostly at low TyreLife
Interpretation
- Lap times don't steadiliy crease with TyreLife - Drivers manage typres, fuel burn off, etc
- High lap time ourliers - Traffic, Tyre warm-up phase, etc
fig, ax = plt.subplots(figsize=(7,5))
ax.scatter(df["Sector1TimeSeconds"], df["LapTimeSeconds"], s=10)
ax.set_title("Sector1TimeSeconds vs LapTimeSeconds")
ax.set_xlabel("Sector1TimeSeconds (sec)")
ax.set_ylabel("LapTimeSeconds (sec)")
plt.show()
Scatter Plot: Sector1TimeSeconds vs LapTimeSeconds¶
Visual Representation
- A strong correlation
Interpretation
- A faster sector 1 time leads to a much faster overall lap time.
Pairplot¶
Quickly went over it in class
Why: This is like quickly goind through all the steps of Explorator Data Analysis.
Input: a few numeric columns
Output: a grid of scatter plots and histograms
small = df[["TyreLife","Sector1TimeSeconds","Sector2TimeSeconds","Sector3TimeSeconds","LapTimeSeconds"]].sample(min(2000, len(df)), random_state=7)
sns.pairplot(small, corner=True)
plt.show()
Correlation heatmap¶
Why: Correlation is a quick “are these moving together?” check. It’s not proof of causation (we’ll talk about that in Section 4).
Input: numeric columns
Output: heatmap (values from -1 to +1)
corr = df[["TyreLife","TrackStatus","Sector1TimeSeconds","Sector2TimeSeconds","Sector3TimeSeconds","LapTimeSeconds"]].corr(numeric_only=True)
fig, ax = plt.subplots(figsize=(7,5))
sns.heatmap(corr, annot=True, fmt=".2f", ax=ax)
ax.set_title("Correlation heatmap")
plt.show()
Explanation: Correlation Heatmap¶
This plot shows how strongly each feature moves with another with values range from -1 to +1
- +1 = move together
- 0 = basically unrelated
- –1 = move opposite ways
Boxplot: Compound vs Lap Time¶
Why: Tyre compound acts like “gear choices” — different behaviours. Boxplots show median + spread + outliers.
Input: Compound categories + LapTime_s
Output: boxplot insight about compound speed/spread
fig, ax = plt.subplots(figsize=(7,5))
sns.boxplot(data=df, x="Compound", y="LapTimeSeconds", ax=ax)
ax.set_title("LapTimeSeconds by Compound")
plt.show()
Explanation: Boxplot¶
- Soft tyres = fastest: Lowest median lap time
- Medium tyres = consistent: Few extreme outliers
- Hard tyres = consistent but slowest on average'
- The few outliers could be due to VSC, Safety Car, etc
Fit a Line and Curve: Linear and Quadratic Function¶
It’s just to see:
- Is the relationship kind of linear?
- Or does it curve (like tyres degrading faster after some point)?
We’ll fit:
- Linear:
y = a*x + b - Quadratic:
y = a*x^2 + b*x + c
Input: x = TyreLife, y = LapTime_s
Output: fitted coefficients with plot overlay
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
x = df[["TyreLife"]].values
y = df["LapTimeSeconds"].values
lin = LinearRegression().fit(x, y)
poly = PolynomialFeatures(degree=2, include_bias=False)
x2 = poly.fit_transform(x)
quad = LinearRegression().fit(x2, y)
x_grid = np.linspace(df["TyreLife"].min(), df["TyreLife"].max(), 200).reshape(-1,1)
y_lin = lin.predict(x_grid)
y_quad = quad.predict(poly.transform(x_grid))
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(7,5))
ax.scatter(df["TyreLife"], df["LapTimeSeconds"], s=8, alpha=0.35, label="data")
ax.plot(x_grid.ravel(), y_lin, linewidth=2, label="linear fit")
ax.plot(x_grid.ravel(), y_quad, linewidth=2, label="quadratic fit")
ax.set_title("TyreLife → LapTimeSeconds (simple fits)")
ax.set_xlabel("TyreLife (laps)")
ax.set_ylabel("LapTimeSeconds (sec)")
ax.legend()
plt.show()
lin.coef_, lin.intercept_, quad.coef_, quad.intercept_
(array([-0.06613968]), np.float64(85.25861865157495), array([-0.3076687 , 0.00564496]), np.float64(86.99328853297702))
Linear Regression with preprocessing¶
We use:
- StandardScaler for numeric columns
- OneHotEncoder for Compound
- LinearRegression as the baseline model
Input: feature table X, target y
Output: trained pipeline + metrics
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
feature_cols = ["TyreLife","Compound","Sector1TimeSeconds","Sector2TimeSeconds","Sector3TimeSeconds","TrackStatus"]
target_col = "LapTimeSeconds"
X = df[feature_cols].copy()
y = df[target_col].copy()
num_cols = ["TyreLife","Sector1TimeSeconds","Sector2TimeSeconds","Sector3TimeSeconds","TrackStatus"]
cat_cols = ["Compound"]
preprocess = ColumnTransformer(
transformers=[
("num", StandardScaler(), num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
]
)
model = Pipeline(steps=[
("prep", preprocess),
("reg", LinearRegression())
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
model.fit(X_train, y_train)
pred = model.predict(X_test)
mae = mean_absolute_error(y_test, pred)
rmse = mean_squared_error(y_test, pred)
r2 = r2_score(y_test, pred)
mae, rmse, r2
(8.77948092352794e-15, 1.2687857072455226e-28, 1.0)
Explanation (ChatGPT to the rescue):¶
Choose features and target
- Features = what the model uses to guess the answer (inputs)
- Target = what you want to predict (output)
- We will use tyre age, tyre type, sector times, and track status to predict lap time.
feature_cols = ["TyreLife","Compound","Sector1TimeSeconds","Sector2TimeSeconds","Sector3TimeSeconds","TrackStatus"]
target_col = "LapTimeSeconds"
Build X (inputs) and y (outputs)
- X = the spreadsheet without the answer column
- y = the answer column (lap time)
X = df[feature_cols].copy()
y = df[target_col].copy()
Note: .copy() just prevents accidental edits to the original data.
Splitting Columnds into number-type and category-type
- Numeric columns are numbers (can be scaled)
- Categorical columns are words/labels (need encoding)
num_cols = ["TyreLife","Sector1TimeSeconds","Sector2TimeSeconds","Sector3TimeSeconds","TrackStatus"]
cat_cols = ["Compound"]
Preprocessing Setup
preprocess = ColumnTransformer(transformers= ("num", StandardScaler(), num_cols), ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),])
- ColumnTransformer - Sends different coloumns to different cleaners
- Name: "num" (just a label)
- Tool: StandardScaler()
For Compound:
- "SOFT" - [1, 0, 0]
- "MEDIUM" - [0, 1, 0]
- "HARD" - [0, 0, 1]
Build Pipeline
- pre = clean the data
- reg = run linear regression
Why Pipeline?
- can't forget preprocessing
- training and test use the same exact transformation
- avoids training on test data
model = Pipeline(steps=[("prep", preprocess),("reg", LinearRegression())])
Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
- 80% training
- 20% testing
- random_state=7 - same split every run
Train the model
model.fit(X_train, y_train) pred = model.predict(X_test)
- fit = learn patterns from training data
- predict = guess lap times for the test data
Grade the model
mae = mean_absolute_error(y_test, pred) rmse = mean_squared_error(y_test, pred, squared=False) r2 = r2_score(y_test, pred)
- MEA (Mean Absolute Error): One average, how far off is the predictions
- The linear regression model is off by: 8.77948092352794e-15
RMSE (Root Mean Squared Error): Small mistakes are okay but get larger with a few bad predictions.
r2 (R-squared): Calculate how well the model explains what’s going on overall. Think of it as:
- 1.0 - perfect predictions
- 0.0 - no better than guessing the average
- Negative - worse than guessing
Cross-validation¶
- Checking if the model is reliable and does well on a consistent basis.
- This will help trust the results
# splits data into multiple chunks
# trains and tests the model multiple times automatically
from sklearn.model_selection import KFold, cross_val_score # tools for cross-validation
# create a K-Fold splitter:
# shuffle=True - randomize the data before splitting
cv = KFold(n_splits=5, shuffle=True, random_state=7)
# run cross-validation:
# model - full pipeline (preprocessing + regression)
# X, y - input features and target
# cv=cv - use the 5-fold splitter defined above
scores = cross_val_score(model, X, y, cv=cv, scoring="neg_mean_absolute_error")
# convert negative MAE scores back to positive values
mae_scores = -scores
# output:
# all MAE scores from each fold
# the average MAE (typical error)
# the standard deviation (how consistent the model is)
mae_scores, mae_scores.mean(), mae_scores.std()
(array([4.01772856e-15, 4.98495951e-15, 2.23207142e-15, 1.75589618e-14,
1.04163333e-14]),
np.float64(7.842010926608855e-15),
np.float64(5.5732498699917915e-15))
Explanation:¶
- Split your data into 5 equal parts
- Shuffle the data first (fairness)
- Keep results reproducible (random_state=7)
cv = KFold(n_splits=5, shuffle=True, random_state=7)
How it works
- Train on 4 parts
- Test on 1 part
- Repeat 5 times
cross_val_scire
- Your full pipeline (model) is trained 5 separate times
- Each time with a different train/test split
- Each run produces one score
scores = cross_val_score(model, X, y, cv=cv, scoring="neg_mean_absolute_error")
neg_mean_absolute_error
- scikit-learn assumes higher score = better
- MAE is “lower is better”
- So sklearn returns it as negative
scoring="neg_mean_absolute_error"
coverting back to positive MAE
mae_scores = -scores
Interpretation:¶
- MAE - lower is better
import matplotlib.pyplot as plt
import numpy as np
# Create a square figure and axis for the scatter plot
fig, ax = plt.subplots(figsize=(6,6))
# Scatter plot: actual lap times (x-axis) vs predicted lap times (y-axis)
# Each dot represents one lap
ax.scatter(y_test, pred, s=10, alpha=0.5)
# Find the minimum and maximum values across both actual and predicted
# This ensures the diagonal reference line spans the full data range
mn = min(y_test.min(), pred.min())
mx = max(y_test.max(), pred.max())
# Plot a 45-degree reference line (perfect prediction line: y = x)
ax.plot([mn, mx], [mn, mx], linewidth=2)
# Add title and axis labels for clarity
ax.set_title("Predicted vs Actual LapTimeSeconds")
ax.set_xlabel("Actual (sec)")
ax.set_ylabel("Predicted (sec)")
# Display the scatter plot
plt.show()
# Calculate prediction errors for each lap
# Positive value = model overpredicted, negative = underpredicted
errors = pred - y_test.to_numpy()
# Create a new figure for the error distribution
fig, ax = plt.subplots(figsize=(8,4))
# Plot a histogram of prediction errors
# Shows how often different error sizes occur
ax.hist(errors, bins=40)
# Add title and axis labels for the error plot
ax.set_title("Prediction errors (pred - actual)")
ax.set_xlabel("seconds")
ax.set_ylabel("count")
# Display the histogram
plt.show()
Interpretation:¶
What this plot is:
- X-axis: real lap time (what actually happened)
- Y-axis: predicted lap time (what the model guessed)
- Each dot = one lap
If the model is perfect, points fall on a 45° diagonal line which means that the model is predicting lap times almost perfectly. Oh mean, just consulted with ChatGPT and this is "Just too good to be true."
This is because: LapTime ≈ Sector1 + Sector2 + Sector3
What is the model good for?
- It is good for sanity checking timing data and understaning ML mechanics.
What is the model not good for?
- Predicting future laps
- Support strategy decisions
- Model tyre degradtion realistically.
Insights¶
- Sector times are super predictive
- TyreLife often drifts lap time upward, but not always (tyre management, fuel load, etc create outliers)
- TrackStatus can create slow lap clusters.
- Linear Regression is the baseline you compare everything else to.
- To create a strong model you need a lot of addtional features
I had I hopes for this model but it's a good starting point in my ML learning journey.