Jeoffrey George - Fab Futures - Data Science
Home About

< Week 2 - Fitting a function ...Week 3 - Probability distribution >

Week 2 - Fitting a machine learning model¶

In [1]:
country_name = "United States of America" #select the country to show

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_parquet("/home/jovyan/work/jeogeorge/datasets/soybean_yield_country.parquet") 
dfmedian=(df.groupby(["Country", "year"], as_index=False)["Yield"].median().rename(columns={"Yield":"median_yield"}))
sel = dfmedian[(dfmedian["Country"] == country_name) & dfmedian["median_yield"]]
x = sel["year"]
y = sel["median_yield"]
plt.scatter(
    x,
    y, 
    label= f"Median by year",
)
plt.xlabel("Year")
plt.ylabel(r"Yield (t.ha$^{-1}$)")
plt.title(f"Distribution of soybean yield in {country_name} by year")
plt.show()
dfmedian.head()
No description has been provided for this image
Out[1]:
Country year median_yield
0 Argentina 1982 2.075966
1 Argentina 1983 1.851908
2 Argentina 1984 2.436925
3 Argentina 1985 1.999814
4 Argentina 1986 2.105784

ChatGPT 5.1 prompt "Fit a machine learning model to fit Tikhonov regularization on a dataset consisting of the median yield of soybeans per year per country. How to approach it with a training and testing dataset?" "Choose the train/test split by country rather than by time" "How well can a regularized model generalize to new countries it has never seen before?"

In [2]:
#1.Training pipeline (Python / scikit-learn)
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
#2. Country-based train/test split 
## Step 1: Select held-out countries
countries = dfmedian['Country'].unique()
np.random.seed(42)
test_countries = np.random.choice(
    countries,
    size=int(0.2 * len(countries)),
    replace=False
)
train_countries = [c for c in countries if c not in test_countries]
## Step 2: Split the data
train_df = dfmedian[dfmedian['Country'].isin(train_countries)]
test_df  = dfmedian[dfmedian['Country'].isin(test_countries)]
X_train, y_train = train_df[['year', 'Country']], train_df['median_yield']
X_test, y_test   = test_df[['year', 'Country']], test_df['median_yield']
#train_df.head()
#test_df.head()
#3. Model pipeline (ridge = Tikhonov)
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['year']),
        ('cat', OneHotEncoder(handle_unknown='ignore'), ['Country'])
    ]
)
ridge_model = Pipeline(
    steps=[
        ('preprocess', preprocessor),
        ('model', Ridge(alpha=1.0))
    ]
)
#4. Fit the model
ridge_model.fit(X_train, y_train)
#5. Evaluate on unseen countries
from sklearn.metrics import mean_squared_error
y_pred = ridge_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"RMSE on unseen countries: {rmse:.3f}")
RMSE on unseen countries: 1.261

ChatGPT 5.1 prompt "I want to use a machine-learning model to analyze soybean yield trends over time. Specifically, I want to:

fit an ML model to capture long-term technological yield trends,

separate trend from weather-driven variability, and

detect anomalous years like drought or flood years.

Please adapt my existing Python code to use a machine-learning model (e.g., Random Forest or Gradient Boosting) to estimate the trend and identify anomalies.*"

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# ---------------------------------------------------
# 1. Load data and compute yearly median yield
# ---------------------------------------------------
country_name = "United States of America"

df = pd.read_parquet("/home/jovyan/work/jeogeorge/datasets/soybean_yield_country.parquet")

dfmedian = (
    df.groupby(["Country", "year"], as_index=False)["Yield"]
    .median()
    .rename(columns={"Yield": "median_yield"})
)

sel = dfmedian[dfmedian["Country"] == country_name].copy()

X = sel["year"].to_numpy().reshape(-1, 1)
y = sel["median_yield"].to_numpy()

# ---------------------------------------------------
# 2. Machine-Learning Model: Random Forest
# ---------------------------------------------------
rf = RandomForestRegressor(
    n_estimators=500,
    max_depth=None,        # allow flexibility to model nonlinear trends
    min_samples_split=2,
    random_state=42
)

rf.fit(X, y)

# Trend curve (ML-based estimate of structural trend)
sel["trend_pred"] = rf.predict(X)

# ---------------------------------------------------
# 3. Residuals = deviations from ML-estimated trend
# ---------------------------------------------------
sel["residual"] = sel["median_yield"] - sel["trend_pred"]

# Anomaly threshold: 2 standard deviations of residuals
threshold = 2 * sel["residual"].std()
sel["anomaly"] = sel["residual"].abs() > threshold

# ---------------------------------------------------
# 4. Plot ML-trend + anomalies
# ---------------------------------------------------
plt.figure(figsize=(12, 4))
plt.scatter(sel["year"], sel["median_yield"], label="Median yield", alpha=0.7)
plt.plot(sel["year"], sel["trend_pred"], color="orange", linewidth=2, label="ML Trend (Random Forest)")

# Highlight anomalies
anomalies = sel[sel["anomaly"]]
plt.scatter(
    anomalies["year"], anomalies["median_yield"],
    color="red", s=80, label="Anomalous years"
)

plt.xlabel("Year")
plt.ylabel(r"Yield (t ha$^{-1}$)")
plt.title(f"ML-detected trend and anomalies in {country_name}")
plt.legend()
plt.show()

# ---------------------------------------------------
# 5. Report anomalies
# ---------------------------------------------------
print("Detected anomalous years (machine-learning residual analysis):")
print(anomalies[["year", "median_yield", "residual"]])
No description has been provided for this image
Detected anomalous years (machine-learning residual analysis):
      year  median_yield  residual
1067  1994      2.954587  0.161505
In [ ]: