Madhu Limbu - Fab Futures - Data Science
Home About

< Home

Assignment¶

Fit a machine learning model to your data

Fit a machine learning model to your data¶

Python code that fits a machine‑learning model to the data for China’s CO₂ emissions.¶

Using bar graph¶

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# 1. Load data
df = pd.read_csv("datasets/climate.csv")
df['date'] = pd.to_datetime(df['date'])

# 2. Filter for China
df_china = df[df['country'] == "China"].sort_values('date').copy()

# 3. Prepare features and target
df_china['date_ordinal'] = df_china['date'].map(pd.Timestamp.toordinal)
feature_cols = ['date_ordinal', 'energy_consumption', 'avg_temperature', 'humidity']
X = df_china[feature_cols].fillna(0).values
y = df_china['co2_emission'].values

# 4. Split into train/test
X_train, X_test, y_train, y_test, dates_train, dates_test = train_test_split(
    X, y, df_china['date'], test_size=0.2, random_state=42
)

# 5. Fit model
model = LinearRegression()
model.fit(X_train, y_train)

# 6. Evaluate
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

# 7. Plot: Actual vs Predicted as bar graph
plt.figure(figsize=(16,6))

# Plot actual CO2 emissions
plt.bar(dates_test, y_test, width=1, alpha=0.6, color='skyblue', label='Actual CO2')

# Overlay predicted CO2 emissions
plt.bar(dates_test, y_pred, width=1, alpha=0.7, color='salmon', label='Predicted CO2')

plt.xlabel("Date")
plt.ylabel("CO2 Emission")
plt.title("China CO2 Emission: Actual vs Predicted")
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()
MSE: 50632.96445883851
R²: 0.007811812078080904
Coefficients: [-0.00597355  0.0093239  -0.94021784  0.1682823 ]
Intercept: 4800.520689860886
No description has been provided for this image

Year 2020, for better reading¶

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# 1. Load dataset
df = pd.read_csv("datasets/climate.csv")
df['date'] = pd.to_datetime(df['date'])

# 2. Filter for China and years 2020-2021
df_china = df[(df['country'] == "China") & (df['date'].dt.year.isin([2020, 2021]))].sort_values('date').copy()

# 3. Prepare features and target
df_china['date_ordinal'] = df_china['date'].map(pd.Timestamp.toordinal)
feature_cols = ['date_ordinal', 'energy_consumption', 'avg_temperature', 'humidity']
X = df_china[feature_cols].fillna(0).values
y = df_china['co2_emission'].values

# 4. Split into train/test
X_train, X_test, y_train, y_test, dates_train, dates_test = train_test_split(
    X, y, df_china['date'], test_size=0.2, random_state=42
)

# 5. Fit linear regression
model = LinearRegression()
model.fit(X_train, y_train)

# 6. Evaluate
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

# 7. Plot actual vs predicted as a bar graph
plt.figure(figsize=(16,6))

# Actual CO2 emission bars
plt.bar(dates_test, y_test, width=1, alpha=0.6, color='skyblue', label='Actual CO2')

# Predicted CO2 emission bars overlayed
plt.bar(dates_test, y_pred, width=1, alpha=0.7, color='salmon', label='Predicted CO2')

plt.xlabel("Date")
plt.ylabel("CO2 Emission")
plt.title("China CO2 Emission (2020): Actual vs Predicted")
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()
MSE: 58229.880419159854
R²: -0.00032924717434634765
Coefficients: [-0.09801861  0.01018575 -2.13020096 -0.35608997]
Intercept: 72745.65997169945
No description has been provided for this image

References & Resources¶

The “Simple Linear Regression” approach from scikit‑learn. The “Simple Linear Regression” approach from scikit‑learn.

ChatGPT Prompt for Machine Learning Model¶

Prompts:

  • Please write Python code to build a machine learning model that predicts daily COâ‚‚ emissions for China. Follow these steps:

  • Filter the dataset to include only China and, optionally, the years 2020–2021 for readability.

  • Convert the date column to datetime and create a numeric feature from it (e.g., ordinal) for modelling.

  • Select features: date_ordinal, energy_consumption, avg_temperature, and humidity.

  • Split the data into training and test sets (e.g., 80/20).

  • Train a linear regression model (or any other regressor, such as RandomForestRegressor) on the training data.

  • Evaluate the model using metrics like Mean Squared Error (MSE) and R² score.

  • Visualise the results: plot a bar graph of actual vs predicted COâ‚‚ emissions for the test set, with clear labels, colours, and legend.

Challanges¶

I initially had difficulty reading and interpreting my dataset because I visualised it using a bubble chart, which made the information crowded and hard to understand. I reached out to Rico for assistance, and he suggested switching to a bar graph instead. He also shared a tutorial that helped me clearly understand how to structure my data and present it in a more readable way.

Tutorial

Youtube Tutorial¶

Handwritten Digit Recognition [Demo] | Make a complete project | Python and Machine Learning
In [ ]: