< Home
Lesson 5: Probability distribution¶
Assignment:
- Investigate the probability distribution of your data
- Set up template notebooks and slides for your data set analysis
Step 1: Load and Clean Data
- Replace - with NaN
- Remove commas and convert to numeric
- Select only numeric variables of interest, e.g., Production (MT) across fruits
- Handle missing data via exclusion or imputation (we’ll exclude for now to avoid bias)
In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew, kurtosis
# Load raw data
df = pd.read_csv("datasets/Final_report_tables_2021AS.csv", header=[0, 1], index_col=0)
# Fix multi-index column names
df.columns = [f"{col[0].strip()}_{col[1].strip()}" for col in df.columns]
# Replace " - " with NaN and remove commas from numbers
df.replace(to_replace=r"-\s*", value=np.nan, regex=True, inplace=True)
# Function to clean numeric strings (e.g., "1,41,068" -> 141068)
def clean_number(x):
if pd.isna(x):
return x
return float(str(x).replace(',', ''))
# Apply cleaning to all columns
df_clean = df.map(clean_number)
# Display cleaned shape
print("Cleaned DataFrame shape:", df_clean.shape)
Cleaned DataFrame shape: (20, 75)
In [12]:
# Select only Production (MT) columns
production_cols = [col for col in df_clean.columns if "Production (MT)" in col]
production_df = df_clean[production_cols].copy()
print(f"Extracted {len(production_cols)} production columns")
print("Production data shape:", production_df.shape)
Extracted 25 production columns Production data shape: (20, 25)
Step 3: Statistical Summary
- Plot global histogram of all non-missing production values
- Fit a Gaussian (check if CLT-like behavior emerges)
- Compute entropy to quantify information/uncertainty
Many agricultural datasets:
- Are zero-inflated (many regions don’t grow certain crops)
- Have long tails (a few regions dominate production)
- Are not Gaussian
We will test visually and via descriptive stats (skewness, kurtosis).
In [13]:
# Flatten to 1D array (ignore Dzongkhag/fruit structure for global distribution)
prod_values = production_df.values.flatten()
prod_values = prod_values[~np.isnan(prod_values)]
print(f"Total valid production entries: {len(prod_values)}")
mean_prod = np.mean(prod_values)
var_prod = np.var(prod_values)
std_prod = np.std(prod_values)
median_prod = np.median(prod_values)
skew_prod = skew(prod_values)
kurt_prod = kurtosis(prod_values)
print("=== Frequentist Statistics ===")
print(f"Mean: {mean_prod:.2f} MT")
print(f"Median: {median_prod:.2f} MT")
print(f"Std Dev: {std_prod:.2f} MT")
print(f"Variance: {var_prod:.2f} MT²")
print(f"Skewness: {skew_prod:.2f}")
print(f"Kurtosis: {kurt_prod:.2f}")
Total valid production entries: 395 === Frequentist Statistics === Mean: 122.39 MT Median: 6.87 MT Std Dev: 714.91 MT Variance: 511102.61 MT² Skewness: 12.00 Kurtosis: 169.19
Step 4: Plot Histogram + Fit Gaussian (for comparison)
In [7]:
plt.figure(figsize=(10, 6))
# Histogram
count, bins, _ = plt.hist(prod_values, bins=50, density=True, alpha=0.6, color='skyblue', edgecolor='black', label='Observed')
# Fit Gaussian with same mean/std
x = np.linspace(0, np.max(prod_values), 500)
gaussian_pdf = (1 / (np.sqrt(2 * np.pi) * std_prod)) * np.exp(-0.5 * ((x - mean_prod) / std_prod) ** 2)
plt.plot(x, gaussian_pdf, 'r--', linewidth=2, label='Fitted Gaussian')
plt.xlabel('Production (MT)')
plt.ylabel('Density')
plt.title('Global Distribution of Fruit Production (MT) in Bhutan')
plt.legend()
plt.yscale('log') # Better visibility for long tail
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.show()
Histogram shows:
- Spike near zero (many small or zero productions)
- Long tail to the right (e.g., Banana, Mandarin, Areca nut in southern Dzongkhags)
Gaussian fit is poor — data is not normally distributed
In [20]:
def entropy_from_samples(samples, bins=256):
hist, _ = np.histogram(samples, bins=bins, density=False)
prob = hist / np.sum(hist)
prob = prob[prob > 0] # Avoid log(0)
H = -np.sum(prob * np.log2(prob))
return H
entropy_val = entropy_from_samples(prod_values, bins=256)
max_entropy = np.log2(256) # = 8 bits
print(f"\n=== Entropy ===")
print(f"Entropy: {entropy_val:.2f} bits")
print(f"Max entropy: {max_entropy:.1f} bits (uniform over 256 bins)")
print(f"Interpretation: Moderate uncertainty — distribution is concentrated but not deterministic.")
=== Entropy === Entropy: 1.26 bits Max entropy: 8.0 bits (uniform over 256 bins) Interpretation: Moderate uncertainty — distribution is concentrated but not deterministic.
Step 6: Identify Long-Tail Behavior (Top Producers)
In [16]:
# Find top 10 production entries
top_productions = production_df.stack().dropna().sort_values(ascending=False).head(10)
print("\n=== Top 10 Production Entries (Long Tail Evidence) ===")
print(top_productions)
=== Top 10 Production Entries (Long Tail Evidence) === Samtse Unnamed: 6_level_0_Production (MT) 11393.46 Sarpang Unnamed: 6_level_0_Production (MT) 6350.19 Dagana Unnamed: 9_level_0_Production (MT) 2791.97 Samdrup Jongkhar Unnamed: 9_level_0_Production (MT) 2377.96 Tsirang Unnamed: 9_level_0_Production (MT) 2147.73 Zhemgang Unnamed: 9_level_0_Production (MT) 1923.52 Sarpang Unnamed: 9_level_0_Production (MT) 1772.16 Samdrup Jongkhar Unnamed: 6_level_0_Production (MT) 1604.64 Paro Unnamed: 3_level_0_Production (MT) 1510.91 Chukha Unnamed: 9_level_0_Production (MT) 1353.37 dtype: float64
Areca nut and Mandarin dominate total output — heavy-tailed crop dominance.
Observations:
- Uncertainty is high in absolute terms (large std dev), but structured: a. Most Dzongkhags produce very little of most fruits b. A few crops × regions dominate national output
- Distribution is long-tailed and non-Gaussian: a. Gaussian assumption would misrepresent risk/variability b. Better modeled with log-normal, Pareto, or zero-inflated models
- Entropy confirms moderate information content: a. Not random, but not predictable without regional/crop context
Recommendation:
- Use density estimation or Bayesian modeling with priors for forecasting
- Consider log-transform before applying Gaussian-based methods (e.g., PCA, regression)
Log-Transform to Visualize Skew Reduction
In [17]:
log_prod = np.log1p(prod_values) # log(1 + x) to handle zeros
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(prod_values, bins=50, density=True, alpha=0.7)
plt.title('Raw Production')
plt.xlabel('MT')
plt.yscale('log')
plt.subplot(1, 2, 2)
plt.hist(log_prod, bins=50, density=True, alpha=0.7, color='orange')
plt.title('Log1p-Transformed Production')
plt.xlabel('log(1 + MT)')
plt.tight_layout()
plt.show()
In [ ]: