SUVENDRAN CITHIVEL - Fab Futures - Data Science
Home About

Probability Distribution Analysis of the ICRISAT Dataset¶

The ICRISAT District-Level Data dataset (16,146 rows, 80 columns) features numerical crop metrics (areas in 1000 ha, productions in 1000 tons, yields in kg/ha) that are typically right-skewed due to agricultural variability (e.g., small farms dominate, with outliers from large/irrigated areas). Missing values (-1.0) were imputed as NaN and handled via dropna or imputation in analysis. Distributions are often zero-inflated (many zeros for minor crops) and non-normal, suggesting log-transforms for ML preprocessing.

Step 1: Load Data and Compute Summary Statistics¶

This code loads the CSV, replaces -1 with NaN, identifies numerical columns, and computes describe() + skewness/kurtosis.

In [2]:
import pandas as pd
import numpy as np
from scipy.stats import skew, kurtosis

# Load data
df = pd.read_csv('datasets/ICRISAT-District Level Data.csv')

# Replace -1 with NaN
df.replace(-1, np.nan, inplace=True)

# Numerical columns
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Summary stats with skew/kurt (drop NaNs)
summary = df[num_cols].describe().T
summary['skew'] = df[num_cols].apply(skew, nan_policy='omit')
summary['kurt'] = df[num_cols].apply(kurtosis, nan_policy='omit')

# Display for key columns (e.g., rice area, wheat/maize yields)
key_cols = ['RICE AREA (1000 ha)', 'WHEAT YIELD (Kg per ha)', 'MAIZE YIELD (Kg per ha)']
print(summary.loc[key_cols])
                           count         mean          std  min     25%  \
RICE AREA (1000 ha)      16124.0   128.770012   160.116360  0.0   10.50   
WHEAT YIELD (Kg per ha)  16111.0  1495.664207  1080.183846  0.0  755.74   
MAIZE YIELD (Kg per ha)  16118.0  1411.212242  1191.278525  0.0  700.00   

                              50%       75%       max      skew       kurt  
RICE AREA (1000 ha)        67.100   191.805   1154.23  1.958608   4.989687  
WHEAT YIELD (Kg per ha)  1350.030  2133.160   5541.52  0.651656   0.172352  
MAIZE YIELD (Kg per ha)  1162.075  1864.860  21428.57  2.248494  11.763736  

Insights:¶

  • Skewness (>0): All positive, indicating right-skew (long tails for high values, e.g., advanced farming districts).
  • Kurtosis: >0 for rice/maize (leptokurtic: heavy tails, outliers); ~0 for wheat (near-normal).
  • Overall: Means > medians; ~10-20% zeros per crop column (zero-inflated).

Step 2: Visualize Distributions (Histograms/KDE)¶

Sample ~1000 points (to avoid overload) and plot probability densities (normalized histograms). For Chart.js rendering, we compute bin mids and heights.

In [3]:
import matplotlib.pyplot as plt
import seaborn as sns

# Sample for plotting (min 1000 rows, drop NaNs)
sample_df = df[key_cols].dropna().sample(n=min(1000, len(df)), random_state=42)

# Plot histograms (density=True for probability)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, col in enumerate(key_cols):
    axes[i].hist(sample_df[col], bins=20, density=True, alpha=0.7, color='skyblue')
    axes[i].set_title(f'PDF: {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Probability Density')
plt.tight_layout()
plt.show()

# Alternative: KDE (smoother)
for col in key_cols:
    sns.kdeplot(data=sample_df, x=col)
    plt.title(f'KDE: {col}')
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [ ]: