< Home
Week 2: Fitting¶
November 25, 2025
I'm really not good at math, especially calculus and algebra. My time in high school gave me the basic understanding and use of math to do simple, not very complex, calculations. After starting to work at Adif (the Spanish railway infrastructure manager), I considered pursuing a technical degree, in this case, Electronic Engineering; but it was a distance learning university, and I had the books and online forums for questions. I took Algebra and Calculus, and it was a complete disaster, haha.
Last Tuesday, when Neil was explaining the class, he combined math and English, and it's not easy after a long morning at work. It's not easy, so I used ChatGPT; sorry about that.
Overfitting¶
November 26, 2025
With the help of ChatGPT, I indicate that I have a new database, in this case: the Volume of Travelers by Time Slot (Barcelona).
Prompt: I have the following CSV file containing the passenger volume by time slot (Barcelona). I need to represent this in Jupyter, using overfitting (fitting noise vs. data), to show passenger volume curves throughout the day in the Barcelona city center. Display time slots.
This is the result, with Grade 12 → overfitting:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# 1. Upload the CSV
df = pd.read_csv(
"datasets/barcelona_viajeros_por_franja_csv.csv",
encoding="latin-1",
sep=";"
)
# 2. We ensure that the time range is text and we order
df["TRAMO_HORARIO"] = df["TRAMO_HORARIO"].astype(str)
df_sorted = df.sort_values("TRAMO_HORARIO")
# 3. Add travelers by time slot (total across the entire hub)
agg = df_sorted.groupby("TRAMO_HORARIO")[["VIAJEROS_SUBIDOS", "VIAJEROS_BAJADOS"]].sum().reset_index()
# 4. Convert "HH:MM - HH:MM" → minutes since midnight (we use the start of the segment)
def time_to_min(tramo):
inicio = tramo.split("-")[0].strip() # "HH:MM "
h, m = map(int, inicio.split(":"))
return h * 60 + m
agg["minutos"] = agg["TRAMO_HORARIO"].apply(time_to_min)
# 5. Data for the adjustment (we use PASSENGERS_ON as an example)
x = agg["minutos"].values
y = agg["VIAJEROS_SUBIDOS"].values
# 6. High-degree polynomial fitting (overfitting)
grado = 12 # You can play with this value
coef = np.polyfit(x, y, grado)
poly = np.poly1d(coef)
x_fit = np.linspace(x.min(), x.max(), 300)
y_fit = poly(x_fit)
# 7. Representation
plt.figure(figsize=(14, 6))
# Real data
plt.plot(x, y, "o", label="Actual data (passengers boarded)")
# Overfitted curve
plt.plot(x_fit, y_fit, label=f"Overfitted model (grade {grado})")
# X-axis with time slot labels
plt.xticks(
ticks=agg["minutos"],
labels=agg["TRAMO_HORARIO"],
rotation=90
)
plt.xlabel("Time slot")
plt.ylabel("Number of passengers boarded")
plt.title("Volume of travelers per time slot (Barcelona)")
plt.legend()
plt.tight_layout()
plt.show()
A clear example of overfitting is shown by applying a high-degree polynomial curve (degree 12) to real-world passenger data for each time slot in downtown Barcelona.
The graph shows:
Yellow dots → real data (passengers boarded)
Blue line → highly overfitted model attempting to capture the noise in the dataset
The overfitting is evident in the exaggerated oscillations and extreme values ​​that don't make sense in reality.
I realize that between 00:30 and 05:00 the curve spikes, because there are no trains and infrastructure maintenance is being carried out.
Grade 2 → underfitting:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# 1. Upload the CSV
df = pd.read_csv(
"datasets/barcelona_viajeros_por_franja_csv.csv",
encoding="latin-1",
sep=";"
)
# 2. We ensure that the time range is text and we order
df["TRAMO_HORARIO"] = df["TRAMO_HORARIO"].astype(str)
df_sorted = df.sort_values("TRAMO_HORARIO")
# 3. Add travelers by time slot (total across the entire hub)
agg = df_sorted.groupby("TRAMO_HORARIO")[["VIAJEROS_SUBIDOS", "VIAJEROS_BAJADOS"]].sum().reset_index()
# 4. Convert "HH:MM - HH:MM" → minutes since midnight (we use the start of the segment)
def time_to_min(tramo):
inicio = tramo.split("-")[0].strip() # "HH:MM "
h, m = map(int, inicio.split(":"))
return h * 60 + m
agg["minutos"] = agg["TRAMO_HORARIO"].apply(time_to_min)
# 5. Data for the adjustment (we use PASSENGERS_ON as an example)
x = agg["minutos"].values
y = agg["VIAJEROS_SUBIDOS"].values
# 6. High-degree polynomial fitting (overfitting)
grado = 2 # You can play with this value
coef = np.polyfit(x, y, grado)
poly = np.poly1d(coef)
x_fit = np.linspace(x.min(), x.max(), 300)
y_fit = poly(x_fit)
# 7. Representation
plt.figure(figsize=(14, 6))
# Real data
plt.plot(x, y, "o", label="Actual data (passengers boarded)")
# Overfitted curve
plt.plot(x_fit, y_fit, label=f"Overfitted model (grade {grado})")
# X-axis with time slot labels
plt.xticks(
ticks=agg["minutos"],
labels=agg["TRAMO_HORARIO"],
rotation=90
)
plt.xlabel("Time slot")
plt.ylabel("Number of passengers boarded")
plt.title("Volume of travelers per time slot (Barcelona)")
plt.legend()
plt.tight_layout()
plt.show()
Grade 5 → good fit:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# 1. Upload the CSV
df = pd.read_csv(
"datasets/barcelona_viajeros_por_franja_csv.csv",
encoding="latin-1",
sep=";"
)
# 2. We ensure that the time range is text and we order
df["TRAMO_HORARIO"] = df["TRAMO_HORARIO"].astype(str)
df_sorted = df.sort_values("TRAMO_HORARIO")
# 3. Add travelers by time slot (total across the entire hub)
agg = df_sorted.groupby("TRAMO_HORARIO")[["VIAJEROS_SUBIDOS", "VIAJEROS_BAJADOS"]].sum().reset_index()
# 4. Convert "HH:MM - HH:MM" → minutes since midnight (we use the start of the segment)
def time_to_min(tramo):
inicio = tramo.split("-")[0].strip() # "HH:MM "
h, m = map(int, inicio.split(":"))
return h * 60 + m
agg["minutos"] = agg["TRAMO_HORARIO"].apply(time_to_min)
# 5. Data for the adjustment (we use PASSENGERS_ON as an example)
x = agg["minutos"].values
y = agg["VIAJEROS_SUBIDOS"].values
# 6. High-degree polynomial fitting (overfitting)
grado = 5 # You can play with this value
coef = np.polyfit(x, y, grado)
poly = np.poly1d(coef)
x_fit = np.linspace(x.min(), x.max(), 300)
y_fit = poly(x_fit)
# 7. Representation
plt.figure(figsize=(14, 6))
# Real data
plt.plot(x, y, "o", label="Actual data (passengers boarded)")
# Overfitted curve
plt.plot(x_fit, y_fit, label=f"Overfitted model (grade {grado})")
# X-axis with time slot labels
plt.xticks(
ticks=agg["minutos"],
labels=agg["TRAMO_HORARIO"],
rotation=90
)
plt.xlabel("Time slot")
plt.ylabel("Number of passengers boarded")
plt.title("Volume of travelers per time slot (Barcelona)")
plt.legend()
plt.tight_layout()
plt.show()