< Home
Day 4: Machine Learning¶
Assignment 28/11/2025¶
Fit a machine learning model to your data
Machine Learning¶
1. Humidity – full period (train vs test)¶
This plot shows the full relative humidity time series, with the Ridge Regression model fitted on the training part (green) and evaluated on the test part (red). The comparison with the real data (blue) lets us see how well the model follows the main trends and how it generalizes to unseen days.
What it shows:
- X-axis: Date (UTC), from 2019 to 2020.
- Y-axis: Relative humidity (%).
- Blue curve: Actual measured humidity values.
- Green curve: Model prediction based on the training data.
- Red curve: Model prediction based on the test data (the final part of the series).
Purpose:
- To see if the model (Ridge + polynomials) reasonably follows the shape of the series.
- The green portion indicates how well it "memorizes" the training data.
- The red portion indicates how well it generalizes to unseen data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
# LOAD DATA
df = pd.read_csv("datasets/1363X-20190215-20200416.csv", sep=";")
df["UTC"] = pd.to_datetime(df["UTC"], errors="coerce")
df = df.dropna(subset=["UTC", "Hum"])
df = df.sort_values("UTC")
# FEATURES
X = np.arange(len(df)).reshape(-1, 1)
y = df["Hum"].values
# NORMALIZE X
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# SPLIT
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, shuffle=False
)
# MODEL
degree = 4 # safer
lambda_reg = 100 # stronger regularization
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
model = Ridge(alpha=lambda_reg)
model.fit(X_train_poly, y_train)
# PREDICT
y_pred_train = model.predict(X_train_poly)
y_pred_test = model.predict(X_test_poly)
# ERRORS
print("Train MSE:", mean_squared_error(y_train, y_pred_train))
print("Test MSE:", mean_squared_error(y_test, y_pred_test))
Train MSE: 219.22732953953022 Test MSE: 225.94662985432794
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
# ---------------------------
# 1. LOAD DATA
# ---------------------------
df = pd.read_csv("datasets/1363X-20190215-20200416.csv", sep=";")
df["UTC"] = pd.to_datetime(df["UTC"], errors="coerce")
df = df.dropna(subset=["UTC", "Hum"])
df = df.sort_values("UTC")
# ---------------------------
# 2. FEATURES
# ---------------------------
X = np.arange(len(df)).reshape(-1, 1)
y = df["Hum"].values
# Normalize X to avoid ill-conditioning
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train-test split (no shuffle for time series)
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, shuffle=False
)
# ---------------------------
# 3. MACHINE LEARNING MODEL
# ---------------------------
degree = 4
lambda_reg = 100
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
model = Ridge(alpha=lambda_reg)
model.fit(X_train_poly, y_train)
# ---------------------------
# 4. PREDICTIONS
# ---------------------------
y_pred_train = model.predict(X_train_poly)
y_pred_test = model.predict(X_test_poly)
# ---------------------------
# 5. PLOT
# ---------------------------
plt.figure(figsize=(16,7))
# Real values
plt.plot(df["UTC"], y, label="Real Humidity", color="blue", linewidth=2)
# Train prediction
plt.plot(df["UTC"].iloc[:len(y_pred_train)], y_pred_train,
label="Training Prediction", color="green", linewidth=2)
# Test prediction
plt.plot(df["UTC"].iloc[len(y_pred_train):], y_pred_test,
label="Test Prediction", color="red", linewidth=2)
plt.title("Machine Learning Model Fit - Humidity Prediction\n(Ridge Regression with Polynomial Features)")
plt.xlabel("Date")
plt.ylabel("Humidity (%)")
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
# ---------------------------
# 6. ERRORS
# ---------------------------
print("Train MSE:", mean_squared_error(y_train, y_pred_train))
print("Test MSE:", mean_squared_error(y_test, y_pred_test))
Train MSE: 219.22732953953022 Test MSE: 225.94662985432794
Humidity in 2020 (raw model vs clipped 0–100%)¶
Here we zoom into the year 2020 and compare real humidity (blue) with the model predictions for training (green) and test (red). Some predicted values fall outside the physically valid range (below 0% or above 100%), which motivates adding post-processing.
What they show:
- Same scheme (date on X, humidity on Y) but filtering only for the year 2020.
- Blue: actual humidity in 2020.
- Green: model prediction during the training phase in 2020.
- Red: prediction during the test phase in 2020.
- In the "corrected/clipped" version, predictions are forced to be between 0 and 100%, truncating impossible values.
What they are for:
- Isolating a specific year and seeing if the model is reasonable within a shorter range.
- Displaying an important detail of physical post-processing: although the mathematical model can produce 110% or negative values, you impose realistic limits (0–100).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
# ---------------------------
# 1. LOAD DATA
# ---------------------------
df = pd.read_csv("datasets/1363X-20190215-20200416.csv", sep=";")
df["UTC"] = pd.to_datetime(df["UTC"], errors="coerce")
df = df.dropna(subset=["UTC", "Hum"])
df["YEAR"] = df["UTC"].dt.year
# ---------------------------
# 2. FILTER ONLY YEAR 2020
# ---------------------------
df_2020 = df[df["YEAR"] == 2020].copy()
df_2020 = df_2020.sort_values("UTC")
# If the dataset does not fully include 2020, warn:
if len(df_2020) < 30:
print("Warning: 2020 has few samples. The model may not generalize well.")
# ---------------------------
# 3. FEATURES
# ---------------------------
X = np.arange(len(df_2020)).reshape(-1, 1)
y = df_2020["Hum"].values
# Normalize X for stability
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train-test split without shuffling
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, shuffle=False
)
# ---------------------------
# 4. MACHINE LEARNING MODEL
# ---------------------------
degree = 4 # OK for daily humidity
lambda_reg = 100 # regularization to avoid overfitting
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
model = Ridge(alpha=lambda_reg)
model.fit(X_train_poly, y_train)
# ---------------------------
# 5. PREDICTIONS
# ---------------------------
y_pred_train = model.predict(X_train_poly)
y_pred_test = model.predict(X_test_poly)
# ---------------------------
# 6. PLOT ONLY YEAR 2020
# ---------------------------
plt.figure(figsize=(16,7))
plt.plot(df_2020["UTC"], y, label="Real Humidity (2020)", color="blue", linewidth=2)
# Prediction ranges
plt.plot(df_2020["UTC"].iloc[:len(y_pred_train)], y_pred_train,
label="Training Prediction", color="green", linewidth=2)
plt.plot(df_2020["UTC"].iloc[len(y_pred_train):], y_pred_test,
label="Test Prediction", color="red", linewidth=2)
plt.title("Humidity Prediction for 2020\nRidge Regression + Polynomial Features")
plt.xlabel("Date")
plt.ylabel("Humidity (%)")
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
print("Train MSE:", mean_squared_error(y_train, y_pred_train))
print("Test MSE:", mean_squared_error(y_test, y_pred_test))
Train MSE: 224.37052060725497 Test MSE: 435.1552043398633
# ---------------------------
# 5. PREDICTIONS
# ---------------------------
y_pred_train = model.predict(X_train_poly)
y_pred_test = model.predict(X_test_poly)
# Enforce physical humidity limits (0–100%)
y_pred_train = np.clip(y_pred_train, 0, 100)
y_pred_test = np.clip(y_pred_test, 0, 100)
# ---------------------------
# 6. PLOT
# ---------------------------
plt.figure(figsize=(16,7))
plt.plot(df_2020["UTC"], y, label="Real Humidity (2020)", color="blue", linewidth=2)
plt.plot(df_2020["UTC"].iloc[:len(y_pred_train)], y_pred_train,
label="Training Prediction", color="green", linewidth=2)
plt.plot(df_2020["UTC"].iloc[len(y_pred_train):], y_pred_test,
label="Test Prediction", color="red", linewidth=2)
plt.title("Humidity Prediction for 2020 (Corrected: Clipped to 0–100%)")
plt.xlabel("Date")
plt.ylabel("Humidity (%)")
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
3. Humidity by year: 2020, 2021, 2022¶
These figures shows real daily humidity (blue) and the model’s predictions on training (green) and test (red) data for the years 2020 to 2022. The train and test MSE values provide a quantitative measure of fit quality and help compare how well the same model performs across different years.
What they show:
- Three independent figures, one per year.
- In each:
- Blue: actual humidity for the corresponding year.
- Green: training prediction.
- Red: test prediction.
- Below each figure, the Train MSE and Test MSE for that year are printed.
What they are for:
- Compare how well the same model (same polynomial degree and same regularization) performs in different years.
- See if any year is easier/difficult to fit (e.g., higher test MSE).
- Understand the year-to-year variability in the quality of fit.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
# ---------------------------
# 1. LOAD DATA
# ---------------------------
df = pd.read_csv("datasets/1363X-20190215-20200416.csv", sep=";")
df["UTC"] = pd.to_datetime(df["UTC"], errors="coerce")
df = df.dropna(subset=["UTC", "Hum"])
df["YEAR"] = df["UTC"].dt.year
df = df.sort_values("UTC")
# ---------------------------
# 2. YEARS TO GENERATE
# ---------------------------
years = [2020, 2021, 2022]
# ---------------------------
# 3. MODEL PARAMETERS
# ---------------------------
degree = 4
lambda_reg = 100
for year in years:
df_y = df[df["YEAR"] == year].copy()
if len(df_y) < 30:
print(f"Skipping {year}: Not enough data")
continue
df_y = df_y.sort_values("UTC")
# FEATURES
X = np.arange(len(df_y)).reshape(-1, 1)
y = df_y["Hum"].values
# NORMALIZE X
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# TRAIN/TEST SPLIT
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, shuffle=False
)
# ML MODEL
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
model = Ridge(alpha=lambda_reg)
model.fit(X_train_poly, y_train)
# PREDICTIONS
y_pred_train = model.predict(X_train_poly)
y_pred_test = model.predict(X_test_poly)
# LIMIT TO PHYSICAL VALUES 0–100%
y_pred_train = np.clip(y_pred_train, 0, 100)
y_pred_test = np.clip(y_pred_test, 0, 100)
# PLOT
plt.figure(figsize=(16,7))
plt.plot(df_y["UTC"], y, label=f"Real Humidity {year}", color="blue", linewidth=2)
plt.plot(df_y["UTC"].iloc[:len(y_pred_train)], y_pred_train,
label="Training Prediction", color="green", linewidth=2)
plt.plot(df_y["UTC"].iloc[len(y_pred_train):], y_pred_test,
label="Test Prediction", color="red", linewidth=2)
plt.title(f"Humidity Prediction ({year})\nRidge Regression + Polynomial Degree {degree}")
plt.xlabel("Date")
plt.ylabel("Humidity (%)")
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
# ERROR
print(f"YEAR {year}")
print(" Train MSE:", mean_squared_error(y_train, y_pred_train))
print(" Test MSE:", mean_squared_error(y_test, y_pred_test))
print("--------------------------------------------------")
YEAR 2020 Train MSE: 224.37052060725497 Test MSE: 144.6588854323536 --------------------------------------------------
YEAR 2021 Train MSE: 216.25002655943592 Test MSE: 171.22929187790547 --------------------------------------------------
YEAR 2022 Train MSE: 216.38598313357883 Test MSE: 198.91582079450035 --------------------------------------------------
4. Precipitation by year: 2020, 2021, 2022¶
Daily precipitation (blue) is compared with the model’s training (green) and test (red) predictions for the years 2020 to 2022. The plot highlights that precipitation is harder to model: many zero-rain days and sharp peaks make it difficult for a smooth polynomial model to capture extreme events, even if it can approximate the general level.
What they show:
- Same structure as the humidity charts, but using the Prec column (actual precipitation, in mm).
- Blue: observed precipitation day by day.
- Green: model prediction during training.
- Red: prediction during testing.
- Training and test MSE values at the end of each block.
What they are for:
- To show that the problem is more difficult for precipitation:
- Many days with 0 mm.
- Very irregular peaks when it rains heavily.
- Even so, the model captures a kind of "typical level" and some smooth patterns, but it doesn't reproduce extreme peaks well (it's not a perfect rainfall model, just a smooth approximation).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
# ---------------------------
# 1. LOAD DATA
# ---------------------------
df = pd.read_csv("datasets/1363X-20190215-20200416.csv", sep=";")
df["UTC"] = pd.to_datetime(df["UTC"], errors="coerce")
# usar PRECIPITACIÓN REAL: Prec
df = df.dropna(subset=["UTC", "Prec"])
df["YEAR"] = df["UTC"].dt.year
df = df.sort_values("UTC")
# ---------------------------
# 2. YEARS TO MODEL
# ---------------------------
years = [2020, 2021, 2022]
# ---------------------------
# 3. MODEL PARAMETERS
# ---------------------------
degree = 4
lambda_reg = 100
for year in years:
df_y = df[df["YEAR"] == year].copy()
if len(df_y) < 30:
print(f"Skipping {year}: Not enough data")
continue
df_y = df_y.sort_values("UTC")
# FEATURES
X = np.arange(len(df_y)).reshape(-1, 1)
y = df_y["Prec"].values # ← USAR LA COLUMNA CORRECTA
# Normalize X
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# TRAIN/TEST SPLIT
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, shuffle=False
)
# MACHINE LEARNING MODEL
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
model = Ridge(alpha=lambda_reg)
model.fit(X_train_poly, y_train)
# PREDICTION
y_pred_train = model.predict(X_train_poly)
y_pred_test = model.predict(X_test_poly)
# Physical constraint: precipitation cannot be negative
y_pred_train = np.clip(y_pred_train, 0, None)
y_pred_test = np.clip(y_pred_test, 0, None)
# ---------------------------
# 4. PLOT RESULTS
# ---------------------------
plt.figure(figsize=(16,7))
plt.plot(df_y["UTC"], y,
label=f"Real Precipitation {year}", color="blue", linewidth=2)
plt.plot(df_y["UTC"].iloc[:len(y_pred_train)], y_pred_train,
label="Training Prediction", color="green", linewidth=2)
plt.plot(df_y["UTC"].iloc[len(y_pred_train):], y_pred_test,
label="Test Prediction", color="red", linewidth=2)
plt.title(f"Daily Precipitation Prediction ({year})\nRidge Regression + Polynomial Degree {degree}")
plt.xlabel("Date")
plt.ylabel("Precipitation (mm)")
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
print(f"YEAR {year}")
print(" Train MSE:", mean_squared_error(y_train, y_pred_train))
print(" Test MSE:", mean_squared_error(y_test, y_pred_test))
print("--------------------------------------------------")
YEAR 2020 Train MSE: 0.47155712166088637 Test MSE: 1.0483561534664412 --------------------------------------------------
YEAR 2021 Train MSE: 0.5945258669689761 Test MSE: 0.7334992092700211 --------------------------------------------------
YEAR 2022 Train MSE: 0.2521761550324008 Test MSE: 1.401426994143104 --------------------------------------------------
5. Global precipitation – model vs real + cross-validation¶
This plot shows the full precipitation series (blue) together with the model’s predictions on training (green) and test (red) intervals. The reported cross-validation MSE, along with train and test MSE, summarizes how stable and accurate the model is over different parts of the dataset.
What it shows:
- The entire actual precipitation series in blue.
- Green curve: model prediction during the training period.
- Red curve: model prediction during the test period.
- Also printed:
- Mean MSE of cross-validation (5-fold cross-validation).
- MSE of training and testing at the time split.
What it's for:
- To see, at a glance, how the model performs across the entire time interval.
- Cross-validation tells you if the model is reasonably stable when evaluated on different dataset splits.
- Visually, you can see that the model follows a much smoother line than the rainfall peaks: it's good for capturing trends, but not so much for extreme days.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
# 1. Load and clean data
df = pd.read_csv("datasets/1363X-20190215-20200416.csv", sep=";")
df["UTC"] = pd.to_datetime(df["UTC"], errors="coerce")
df = df.dropna(subset=["UTC", "Prec"])
df = df.sort_values("UTC")
# 2. Prepare features (X) and target (y)
X = np.arange(len(df)).reshape(-1, 1) # simple time index
y = df["Prec"].values
# 3. Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 4. Split data (train/test)
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, shuffle=False
)
# 5. Choose model + hyperparameters
degree = 4
alpha = 100 # regularization strength
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
model = Ridge(alpha=alpha)
# 6. Train model
model.fit(X_train_poly, y_train)
# 7. Validate model with cross-validation (optional but good practice)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, poly.transform(X_scaled), y, cv=kf,
scoring="neg_mean_squared_error")
print("Cross-validation MSE (mean):", -np.mean(cv_scores))
# 8. Predict
y_train_pred = model.predict(X_train_poly)
y_test_pred = model.predict(X_test_poly)
# 9. Optional: clip predictions si la precipitación no puede ser negativa
y_train_pred = np.clip(y_train_pred, 0, None)
y_test_pred = np.clip(y_test_pred, 0, None)
# 10. Plot real vs predicted
plt.figure(figsize=(16,6))
plt.plot(df["UTC"], y, label="Real precipitation", color="blue")
plt.plot(df["UTC"].iloc[:len(y_train_pred)], y_train_pred,
label="Train prediction", color="green")
plt.plot(df["UTC"].iloc[len(y_train_pred):], y_test_pred,
label="Test prediction", color="red")
plt.title("Precipitation – ML Model Prediction vs Real")
plt.xlabel("Date")
plt.ylabel("Precipitation (mm)")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
# 11. Print errors
print("Train MSE:", mean_squared_error(y_train, y_train_pred))
print("Test MSE:", mean_squared_error(y_test, y_test_pred))
Cross-validation MSE (mean): 0.7032694271111948
Train MSE: 0.5774803281776941 Test MSE: 1.2975664003664988
6. Forecast-style precipitation prediction for 2026¶
Using the model trained on historical data, we generate a synthetic daily precipitation curve for the year 2026. Negative predictions are clipped to zero, and the result should be interpreted as a conceptual extrapolation that illustrates how the model behaves beyond the observed time range, rather than as a realistic weather forecast.
What it shows:
- X-axis: days in the year 2026 (from January 1 to December 31).
- Y-axis: daily precipitation predicted by the model (mm).
- Purple curve: rainfall values that the model forecasts for each day of 2026.
- Negative values are truncated to 0 (it cannot rain "−3 mm").
Purpose:
- It's a forecasting experiment: using the model trained on historical data to extrapolate a full year into the future.
- More than a realistic weather prediction, it's a way to:
- See how a regularized polynomial extrapolates beyond the data.
- Illustrate the risks of extrapolation (it tends to generate an overly smooth and rather artificial curve).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
# ---------------------------
# 1. LOAD DATA
# ---------------------------
df = pd.read_csv("datasets/1363X-20190215-20200416.csv", sep=";")
df["UTC"] = pd.to_datetime(df["UTC"], errors="coerce")
df = df.dropna(subset=["UTC", "Prec"])
df = df.sort_values("UTC")
# ---------------------------
# 2. PREPARE FEATURES
# ---------------------------
X = np.arange(len(df)).reshape(-1, 1) # time index
y = df["Prec"].values
# Normalize X
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Polynomial + Regularization
degree = 4
lambda_reg = 100
poly = PolynomialFeatures(degree=degree)
X_poly = poly.fit_transform(X_scaled)
model = Ridge(alpha=lambda_reg)
model.fit(X_poly, y)
# ---------------------------
# 3. CREATE FUTURE DATES — YEAR 2026
# ---------------------------
future_days = 365 # 1 year
last_index = X[-1][0]
X_future = np.arange(last_index + 1, last_index + 1 + future_days).reshape(-1, 1)
# Scale future X
X_future_scaled = scaler.transform(X_future)
# Polynomial transform
X_future_poly = poly.transform(X_future_scaled)
# Predict precipitation
y_future = model.predict(X_future_poly)
# Physical constraint: precipitation cannot be negative
y_future = np.clip(y_future, 0, None)
# Create future date index (2026 only)
start_2026 = pd.Timestamp("2026-01-01")
dates_2026 = pd.date_range(start_2026, periods=future_days, freq="D")
# ---------------------------
# 4. PLOT FORECAST FOR 2026
# ---------------------------
plt.figure(figsize=(16,7))
plt.plot(dates_2026, y_future, color="purple", linewidth=2,
label="Predicted precipitation (2026)")
plt.title("Predicted Daily Precipitation – Year 2026\n(Ridge Regression Forecast)")
plt.xlabel("Date (2026)")
plt.ylabel("Predicted precipitation (mm)")
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
# ---------------------------
# 5. SHOW SAMPLE OUTPUT
# ---------------------------
print("First 10 predicted values for 2026:")
print(y_future[:10])
First 10 predicted values for 2026: [0.43739707 0.43743813 0.4374792 0.43752026 0.43756134 0.43760241 0.43764349 0.43768457 0.43772566 0.43776674]
!pip install torch
Requirement already satisfied: torch in /opt/conda/lib/python3.13/site-packages (2.9.1) Requirement already satisfied: filelock in /opt/conda/lib/python3.13/site-packages (from torch) (3.20.0) Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/lib/python3.13/site-packages (from torch) (4.15.0) Requirement already satisfied: setuptools in /opt/conda/lib/python3.13/site-packages (from torch) (80.9.0) Requirement already satisfied: sympy>=1.13.3 in /opt/conda/lib/python3.13/site-packages (from torch) (1.14.0) Requirement already satisfied: networkx>=2.5.1 in /opt/conda/lib/python3.13/site-packages (from torch) (3.5) Requirement already satisfied: jinja2 in /opt/conda/lib/python3.13/site-packages (from torch) (3.1.6) Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/lib/python3.13/site-packages (from torch) (2025.9.0) Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/lib/python3.13/site-packages (from sympy>=1.13.3->torch) (1.3.0) Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.13/site-packages (from jinja2->torch) (3.0.3)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
# =====================================
# 1. LOAD DATA
# =====================================
df = pd.read_csv("datasets/1363X-20081001-20251107.csv", sep=";")
# Correct date parsing
df["FECHA"] = pd.to_datetime(df["FECHA"], format="%d/%m/%y", errors="coerce")
# Drop rows with missing values
df = df.dropna(subset=["FECHA", "PRECIPITACION"])
df = df.sort_values("FECHA")
# =====================================
# 2. PREPARE FEATURES AND TARGET
# =====================================
# X = time index
X = np.arange(len(df)).reshape(-1, 1)
y = df["PRECIPITACION"].values # the correct column
# Normalize X
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Polynomial features
degree = 4
poly = PolynomialFeatures(degree=degree)
X_poly = poly.fit_transform(X_scaled)
# Model
model = Ridge(alpha=100)
model.fit(X_poly, y)
# =====================================
# 3. FORECAST FOR 2026
# =====================================
# Generate 365 days into the future
future_days = 365
last_index = X[-1][0]
X_future = np.arange(last_index + 1, last_index + future_days + 1).reshape(-1, 1)
X_future_scaled = scaler.transform(X_future)
X_future_poly = poly.transform(X_future_scaled)
y_future = model.predict(X_future_poly)
# Precipitation cannot be negative
y_future = np.clip(y_future, 0, None)
# Generate 2026 date range
dates_2026 = pd.date_range(start="2026-01-01", periods=future_days, freq="D")
# =====================================
# 4. PLOT RESULT
# =====================================
plt.figure(figsize=(16, 7))
plt.plot(dates_2026, y_future, color="purple", linewidth=2)
plt.title("Predicted Daily Precipitation for 2026\n(Polynomial ML Model + Ridge)")
plt.xlabel("Date")
plt.ylabel("Predicted Precipitation (mm)")
plt.grid(True)
plt.tight_layout()
plt.show()
# =====================================
# 5. SHOW SAMPLE OUTPUT
# =====================================
print("First 10 predicted values for 2026 (mm):")
print(y_future[:10])
First 10 predicted values for 2026 (mm): [5.97625779 5.97731266 5.97836786 5.97942341 5.9804793 5.98153552 5.98259209 5.98364899 5.98470624 5.98576383]
Weekly Conclusion¶
This week we took the first serious step into predictive modelling with machine learning applied to our weather data. Using humidity and precipitation series, we built a full pipeline: loading and cleaning the CSV, creating a time variable as index, normalizing inputs, generating polynomial features, and training a Ridge Regression model to capture smooth trends over time without going into extreme overfitting. We also respected the temporal nature of the data (no random shuffling of past and future) and evaluated performance with a proper train/test split and error metrics (MSE), including cross-validation in the precipitation case.
We added simple physical constraints to the predictions (humidity between 0–100 %, precipitation ≥ 0) and explored how the model behaves year by year (2020, 2021, 2022), checking how far it generalizes and where the limitations of such a simple, time-only model start to appear. Finally, we used the model to generate a “rain prediction” for 2026, more as a conceptual experiment than a realistic forecast: it helps us see how the model extrapolates, what the predicted curves look like, and why we need to be very cautious when going beyond the observed data range.
Overall, this week was about moving from describing the past to trying to predict the future with a simple but complete model, and about better understanding the role of regularization, feature design and validation in machine learning. This sets the foundation for using richer models in the future (more input variables, different architectures), with a clear awareness of their risks and limitations.