[Drukdra Dorji] - Fab Futures - Data Science
Home About

Week 6: fit Probability distribution(05 December 2025)¶

The session explains how Expectation–Maximization (E-M) is used to estimate probability distributions when some variables are hidden or latent. E-M alternates between updating the hidden variables using the current model and updating the model using the estimated latent variables, gradually increasing the likelihood with each iteration. This process supports clustering techniques such as k-means, where data points are assigned to the nearest anchor point and anchors are updated to the mean of their assigned points. Concepts like Voronoi tessellation help visualize how space is divided according to anchor locations. The session also introduces Gaussian Mixture Models (GMMs), a soft clustering approach in which each cluster is represented by a Gaussian distribution. In this method, means, variances, and mixture weights are updated through conditional distributions, capturing complex data densities and allowing low-probability clusters to be pruned. The Cluster-Weighted Modeling (CWM), also discribe as mixture-of-experts or Bayesian network–based modeling. In this approach, each cluster includes not only a probability weight and an input-space influence but also a functional dependency in the output space, enabling more expressive models. The mean of each Gaussian becomes an input-dependent function governed by a set of parameters that are updated, often through linear local models. CWM generates conditional forecasts, computes forecast errors, and updates local coefficients to better fit the data. Ultimately, the session demonstrates how these methods form a table of state probabilities that captures how data is distributed across clusters and how functional relationships vary across different regions of the input space.

Assignments: We are asked to Fit the probability distribution of our datasets¶

Compiled Dataset: Alcohol-Related Deaths / Burden in Bhutan¶

Introduction to the Dataset¶

This dataset presents a compiled summary of alcohol-related deaths and alcohol-attributable health indicators in Bhutan, drawn from publicly available national and international sources. The data combines information from the Ministry of Health’s Annual Health Bulletins, the National Statistics Bureau’s Vital Statistics Reports, WHO country profiles, and published research such as the Bhutan Health Journal. It includes annual figures on alcohol-related liver disease (ALD) deaths, the proportion of deaths attributed to alcohol in health facilities, trends across multiple years, and population-level alcohol-consumption indicators. The dataset is designed to provide a clear picture of how alcohol contributes to mortality and public health challenges in Bhutan, enabling further analysis, comparison, and interpretation for academic or policy-related purposes.

Fit Probability Distribuation of Dataset¶

In [1]:
import pandas as pd
import numpy as np
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
import re

# ------------------------------------------------------
# 1. Load CSV file
# ------------------------------------------------------
df = pd.read_csv('datasets/ALD_Data_Big.csv')  # replace with your CSV filename

# ------------------------------------------------------
# 2. Convert 'Value' column to numeric
# ------------------------------------------------------
def extract_numeric(value):
    """Convert string values to numeric, handling %, ~, and ranges."""
    if pd.isnull(value):
        return None
    value = str(value).replace('%', '').replace('~', '').replace('→', '-')
    if '-' in value:
        parts = value.split('-')
        try:
            nums = [float(p.strip()) for p in parts if re.search(r'\d', p)]
            return sum(nums) / len(nums) if nums else None
        except:
            return None
    try:
        return float(value)
    except:
        return None

df['NumericValue'] = df['Value'].apply(extract_numeric)
data = df['NumericValue'].dropna().values.reshape(-1, 1)

# ------------------------------------------------------
# 3. Fit Gaussian Mixture Model (E-M)
# ------------------------------------------------------
gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(data)

# Extract parameters
means = gmm.means_.flatten()
variances = gmm.covariances_.flatten()
weights = gmm.weights_.flatten()

# ------------------------------------------------------
# 4. Plot the data and fitted distributions
# ------------------------------------------------------
x = np.linspace(min(data)-10, max(data)+10, 1000).reshape(-1,1)
logprob = gmm.score_samples(x)
pdf = np.exp(logprob)

plt.figure(figsize=(10,6))
plt.hist(data, bins=20, density=True, alpha=0.5, color='skyblue', label='Data histogram')
plt.plot(x, pdf, color='red', lw=2, label='GMM probability density')

# Plot individual Gaussian components
for mean, var, weight in zip(means, variances, weights):
    component_pdf = weight * (1/np.sqrt(2*np.pi*var)) * np.exp(-0.5*((x-mean)**2)/var)
    plt.plot(x, component_pdf, lw=2, linestyle='--', label=f'Component ÎĽ={mean:.2f}')

plt.title('Gaussian Mixture Model Fit (E-M)')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

Explanation¶

The graph displays dataset’s distribution using a histogram (sky blue bars) overlaid with a Gaussian Mixture Model (GMM) fit. The red curve represents the overall probability density estimated by the EM algorithm, while the dashed curves show the individual Gaussian components corresponding to distinct clusters: small values (6.2) for ALD percentages and incidence changes, mid-range ALD deaths (134.6), and high ALD deaths (179.8). Peaks indicate the most probable value ranges, and the relative heights reflect each cluster’s contribution to the total density. This visualization reveals that the data is multi-modal, with three natural groups, allowing clear identification of low, medium, and high-value clusters, which can be used for understanding trends, detecting anomalies, or probabilistic forecasting.

In [ ]: