Week 4: Machine Learning(28 November 2025)¶
Deep learning models are trained using backpropagation, where errors are propagated backward through the network to update weights using gradient descent. Various optimizers enhance this process: SGD updates using data batches, momentum helps avoid local minima, ADAM adapts learning rates per parameter, and L-BFGS uses curvature information for faster convergence. To reduce overfitting and improve efficiency, techniques like early stopping, dropout, regularization (L1/L2), pruning, and quantization are applied. Challenges such as vanishing or diverging gradients commonly occur in deep networks, and once trained, models perform inference to make predictions. Neural networks come in different forms based on tasks: MLPs or DNNs include hidden layers, CNNs learn spatial features, and RNNs handle sequential dependencies with LSTM improving long memory. Transformers use attention to capture long-range relationships and power large language models (LLMs). GANs and VAEs generate synthetic data, while PINNs incorporate physical laws, and surrogate models emulate complex simulations. Tools like AutoML automate model design, Agentic AI enables autonomous actions, and SVMs provide a classical alternative. Ecosystems including Hugging Face, Kaggle, Edge Impulse, and ONNX support model deployment from large-scale systems to edge devices.
Assignments: We are asked to Fit a machine learning model to our datasets¶
Compiled Dataset: Alcohol-Related Deaths / Burden in Bhutan¶
Introduction to the Dataset¶
This dataset presents a compiled summary of alcohol-related deaths and alcohol-attributable health indicators in Bhutan, drawn from publicly available national and international sources. The data combines information from the Ministry of Health’s Annual Health Bulletins, the National Statistics Bureau’s Vital Statistics Reports, WHO country profiles, and published research such as the Bhutan Health Journal. It includes annual figures on alcohol-related liver disease (ALD) deaths, the proportion of deaths attributed to alcohol in health facilities, trends across multiple years, and population-level alcohol-consumption indicators. The dataset is designed to provide a clear picture of how alcohol contributes to mortality and public health challenges in Bhutan, enabling further analysis, comparison, and interpretation for academic or policy-related purposes.
Assignments: Machine learning model to forecast death rates for the years 2025, 2026, and 2027.¶
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import re
# Load CSV
df = pd.read_csv("datasets/ALD_Data_Big.csv") # replace with your file path
# Keep only rows related to deaths
death_rows = df[df["Metric"].str.contains("deaths", case=False, na=False)]
# Extract numeric year (ignore ranges or text)
def extract_year(x):
match = re.findall(r'\d{4}', str(x))
return int(match[0]) if match else None
death_rows["Year_Num"] = death_rows["Year"].apply(extract_year)
# Convert Value to numeric (non-convertible become NaN)
death_rows["Death_Value"] = pd.to_numeric(death_rows["Value"], errors="coerce")
# Remove rows with NaN in year or death value
death_rows = death_rows.dropna(subset=["Year_Num", "Death_Value"])
# Group by year and average duplicates
clean_data = death_rows.groupby("Year_Num")["Death_Value"].mean().reset_index()
clean_data = clean_data.rename(columns={"Year_Num": "Year", "Death_Value": "ALD_Deaths"})
print("Cleaned Dataset:")
print(clean_data)
# Prepare features and target
X = clean_data[["Year"]]
y = clean_data["ALD_Deaths"]
# Train Linear Regression model
model = LinearRegression()
model.fit(X, y)
# Forecast for 2025–2027
future_years = np.array([2025, 2026, 2027]).reshape(-1, 1)
forecast = model.predict(future_years)
forecast_df = pd.DataFrame({
"Year": [2025, 2026, 2027],
"Predicted_ALD_Deaths": forecast.round(1)
})
print("\nForecasted ALD Deaths:")
print(forecast_df)
Cleaned Dataset: Year ALD_Deaths 0 2016 189.2 1 2020 168.4 2 2021 138.8 3 2023 129.4 Forecasted ALD Deaths: Year Predicted_ALD_Deaths 0 2025 112.3 1 2026 103.4 2 2027 94.6
/tmp/ipykernel_6696/4000328202.py:17: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy death_rows["Year_Num"] = death_rows["Year"].apply(extract_year) /tmp/ipykernel_6696/4000328202.py:20: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy death_rows["Death_Value"] = pd.to_numeric(death_rows["Value"], errors="coerce") /opt/conda/lib/python3.13/site-packages/sklearn/utils/validation.py:2749: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names warnings.warn(
Fit a machine learning model to dataset¶
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import re
# Step 1: Load CSV
df = pd.read_csv("datasets/ALD_Data_Big.csv") # replace with your CSV file path
# Step 2: Filter only rows related to deaths
death_rows = df[df["Metric"].str.contains("deaths", case=False, na=False)]
# Step 3: Extract numeric year
def extract_year(x):
match = re.findall(r'\d{4}', str(x))
return int(match[0]) if match else None
death_rows["Year_Num"] = death_rows["Year"].apply(extract_year)
# Step 4: Convert Value to numeric
death_rows["Death_Value"] = pd.to_numeric(death_rows["Value"], errors="coerce")
# Step 5: Remove rows with NaN in year or value
death_rows = death_rows.dropna(subset=["Year_Num", "Death_Value"])
# Step 6: Group by year and average duplicates
clean_data = death_rows.groupby("Year_Num")["Death_Value"].mean().reset_index()
clean_data = clean_data.rename(columns={"Year_Num": "Year", "Death_Value": "ALD_Deaths"})
# Step 7: Prepare data for model
X = clean_data[["Year"]]
y = clean_data["ALD_Deaths"]
# Step 8: Train Linear Regression model
model = LinearRegression()
model.fit(X, y)
# Step 9: Forecast future years
future_years = np.array([2025, 2026, 2027]).reshape(-1, 1)
forecast = model.predict(future_years)
# Combine historical + forecast for plotting
plot_years = np.concatenate([X["Year"].values, future_years.flatten()])
plot_values = np.concatenate([y.values, forecast])
# Step 10: Plot
plt.figure(figsize=(10,6))
plt.scatter(X["Year"], y, color='blue', label="Historical ALD Deaths")
plt.plot(plot_years, model.predict(plot_years.reshape(-1,1)), color='red', label="Linear Fit & Forecast")
plt.scatter(future_years, forecast, color='green', label="Forecast (2025-2027)", marker='x', s=100)
plt.xlabel("Year")
plt.ylabel("ALD Deaths")
plt.title("ALD Deaths Trend and Forecast (Linear Regression)")
plt.legend()
plt.grid(True)
plt.show()
# Step 11: Show forecast table
forecast_df = pd.DataFrame({
"Year": [2025, 2026, 2027],
"Predicted_ALD_Deaths": forecast.round(1)
})
print("Forecasted ALD Deaths:")
print(forecast_df)
/tmp/ipykernel_6696/700597134.py:18: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy death_rows["Year_Num"] = death_rows["Year"].apply(extract_year) /tmp/ipykernel_6696/700597134.py:21: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy death_rows["Death_Value"] = pd.to_numeric(death_rows["Value"], errors="coerce") /opt/conda/lib/python3.13/site-packages/sklearn/utils/validation.py:2749: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names warnings.warn( /opt/conda/lib/python3.13/site-packages/sklearn/utils/validation.py:2749: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names warnings.warn(
Forecasted ALD Deaths: Year Predicted_ALD_Deaths 0 2025 112.3 1 2026 103.4 2 2027 94.6