Gesar Gurung - Fab Futures - Data Science
Home About

< Home

Lesson 5: Probability distribution¶

Assignment:

  • Investigate the probability distribution of your data
  • Set up template notebooks and slides for your data set analysis

Step 1: Load and Clean Data

  1. Replace - with NaN
  2. Remove commas and convert to numeric
  3. Select only numeric variables of interest, e.g., Production (MT) across fruits
  4. Handle missing data via exclusion or imputation (we’ll exclude for now to avoid bias)
In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew, kurtosis

# Load raw data
df = pd.read_csv("datasets/Final_report_tables_2021AS.csv", header=[0, 1], index_col=0)

# Fix multi-index column names
df.columns = [f"{col[0].strip()}_{col[1].strip()}" for col in df.columns]

# Replace " -   " with NaN and remove commas from numbers
df.replace(to_replace=r"-\s*", value=np.nan, regex=True, inplace=True)

# Function to clean numeric strings (e.g., "1,41,068" -> 141068)
def clean_number(x):
    if pd.isna(x):
        return x
    return float(str(x).replace(',', ''))

# Apply cleaning to all columns
df_clean = df.map(clean_number)

# Display cleaned shape
print("Cleaned DataFrame shape:", df_clean.shape)
Cleaned DataFrame shape: (20, 75)
In [12]:
# Select only Production (MT) columns
production_cols = [col for col in df_clean.columns if "Production (MT)" in col]
production_df = df_clean[production_cols].copy()

print(f"Extracted {len(production_cols)} production columns")
print("Production data shape:", production_df.shape)
Extracted 25 production columns
Production data shape: (20, 25)

Step 3: Statistical Summary

  1. Plot global histogram of all non-missing production values
  2. Fit a Gaussian (check if CLT-like behavior emerges)
  3. Compute entropy to quantify information/uncertainty

Many agricultural datasets:

  1. Are zero-inflated (many regions don’t grow certain crops)
  2. Have long tails (a few regions dominate production)
  3. Are not Gaussian

We will test visually and via descriptive stats (skewness, kurtosis).

In [13]:
# Flatten to 1D array (ignore Dzongkhag/fruit structure for global distribution)
prod_values = production_df.values.flatten()
prod_values = prod_values[~np.isnan(prod_values)]

print(f"Total valid production entries: {len(prod_values)}")
mean_prod = np.mean(prod_values)
var_prod = np.var(prod_values)
std_prod = np.std(prod_values)
median_prod = np.median(prod_values)
skew_prod = skew(prod_values)
kurt_prod = kurtosis(prod_values)

print("=== Frequentist Statistics ===")
print(f"Mean:       {mean_prod:.2f} MT")
print(f"Median:     {median_prod:.2f} MT")
print(f"Std Dev:    {std_prod:.2f} MT")
print(f"Variance:   {var_prod:.2f} MT²")
print(f"Skewness:   {skew_prod:.2f}")
print(f"Kurtosis:   {kurt_prod:.2f}")
Total valid production entries: 395
=== Frequentist Statistics ===
Mean:       122.39 MT
Median:     6.87 MT
Std Dev:    714.91 MT
Variance:   511102.61 MT²
Skewness:   12.00
Kurtosis:   169.19

Step 4: Plot Histogram + Fit Gaussian (for comparison)

In [7]:
plt.figure(figsize=(10, 6))

# Histogram
count, bins, _ = plt.hist(prod_values, bins=50, density=True, alpha=0.6, color='skyblue', edgecolor='black', label='Observed')

# Fit Gaussian with same mean/std
x = np.linspace(0, np.max(prod_values), 500)
gaussian_pdf = (1 / (np.sqrt(2 * np.pi) * std_prod)) * np.exp(-0.5 * ((x - mean_prod) / std_prod) ** 2)
plt.plot(x, gaussian_pdf, 'r--', linewidth=2, label='Fitted Gaussian')

plt.xlabel('Production (MT)')
plt.ylabel('Density')
plt.title('Global Distribution of Fruit Production (MT) in Bhutan')
plt.legend()
plt.yscale('log')  # Better visibility for long tail
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.show()
No description has been provided for this image

Histogram shows:

  1. Spike near zero (many small or zero productions)
  2. Long tail to the right (e.g., Banana, Mandarin, Areca nut in southern Dzongkhags)

Gaussian fit is poor — data is not normally distributed

In [20]:
def entropy_from_samples(samples, bins=256):
    hist, _ = np.histogram(samples, bins=bins, density=False)
    prob = hist / np.sum(hist)
    prob = prob[prob > 0]  # Avoid log(0)
    H = -np.sum(prob * np.log2(prob))
    return H

entropy_val = entropy_from_samples(prod_values, bins=256)
max_entropy = np.log2(256)  # = 8 bits

print(f"\n=== Entropy ===")
print(f"Entropy:      {entropy_val:.2f} bits")
print(f"Max entropy:  {max_entropy:.1f} bits (uniform over 256 bins)")
print(f"Interpretation: Moderate uncertainty — distribution is concentrated but not deterministic.")
=== Entropy ===
Entropy:      1.26 bits
Max entropy:  8.0 bits (uniform over 256 bins)
Interpretation: Moderate uncertainty — distribution is concentrated but not deterministic.

Step 6: Identify Long-Tail Behavior (Top Producers)

In [16]:
# Find top 10 production entries
top_productions = production_df.stack().dropna().sort_values(ascending=False).head(10)

print("\n=== Top 10 Production Entries (Long Tail Evidence) ===")
print(top_productions)
=== Top 10 Production Entries (Long Tail Evidence) ===
Samtse            Unnamed: 6_level_0_Production (MT)    11393.46
Sarpang           Unnamed: 6_level_0_Production (MT)     6350.19
Dagana            Unnamed: 9_level_0_Production (MT)     2791.97
Samdrup Jongkhar  Unnamed: 9_level_0_Production (MT)     2377.96
Tsirang           Unnamed: 9_level_0_Production (MT)     2147.73
Zhemgang          Unnamed: 9_level_0_Production (MT)     1923.52
Sarpang           Unnamed: 9_level_0_Production (MT)     1772.16
Samdrup Jongkhar  Unnamed: 6_level_0_Production (MT)     1604.64
Paro              Unnamed: 3_level_0_Production (MT)     1510.91
Chukha            Unnamed: 9_level_0_Production (MT)     1353.37
dtype: float64

Areca nut and Mandarin dominate total output — heavy-tailed crop dominance.

Observations:

  1. Uncertainty is high in absolute terms (large std dev), but structured: a. Most Dzongkhags produce very little of most fruits b. A few crops × regions dominate national output
  2. Distribution is long-tailed and non-Gaussian: a. Gaussian assumption would misrepresent risk/variability b. Better modeled with log-normal, Pareto, or zero-inflated models
  3. Entropy confirms moderate information content: a. Not random, but not predictable without regional/crop context

Recommendation:

  1. Use density estimation or Bayesian modeling with priors for forecasting
  2. Consider log-transform before applying Gaussian-based methods (e.g., PCA, regression)

Log-Transform to Visualize Skew Reduction

In [17]:
log_prod = np.log1p(prod_values)  # log(1 + x) to handle zeros

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(prod_values, bins=50, density=True, alpha=0.7)
plt.title('Raw Production')
plt.xlabel('MT')
plt.yscale('log')

plt.subplot(1, 2, 2)
plt.hist(log_prod, bins=50, density=True, alpha=0.7, color='orange')
plt.title('Log1p-Transformed Production')
plt.xlabel('log(1 + MT)')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]: