SUVENDRAN CITHIVEL - Fab Futures - Data Science
Home About

Week 3 : Machine Learning¶

Machine learning (ML) is a core pillar of data science, enabling computers to learn patterns from data and make predictions or decisions without being explicitly programmed for every scenario. In essence, it's the "brain" that turns raw data into actionable insights, automating tasks like forecasting sales, detecting fraud, or recommending products.

Using Scikit-Learn for Classification on the ICRISAT Dataset¶

This dataset contains district-level agricultural data from 1966–2017 across 20 Indian states, with 16,146 rows and 80 columns (5 categorical/IDs + 75 numerical features for crop areas, productions, and yields). Many values are in 1000 ha/tons, and -1.0 indicates missing data (treated as NaN for preprocessing). For this example, we'll perform multiclass classification to predict the State Name (20 classes) using numerical crop features (areas, productions, yields). This captures regional agricultural patterns (e.g., rice-dominant states like West Bengal vs. wheat-dominant like Punjab). Classes are imbalanced (e.g., Uttar Pradesh: 2,392 samples; Jharkhand: 300), so we'll use stratified splitting.

Step1 : Loading Dataset¶

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv('datasets/ICRISAT-District Level Data.csv')

# Quick exploration
print(f"Shape: {df.shape}")
print(f"Unique states: {df['State Name'].nunique()}")
print(df['State Name'].value_counts().head())  # Top 5 states
print(df.head(2))  # First 2 rows
Shape: (16146, 80)
Unique states: 20
State Name
Uttar Pradesh     2392
Madhya Pradesh    1924
Rajasthan         1352
Maharashtra       1348
Karnataka          988
Name: count, dtype: int64
   Dist Code  Year  State Code    State Name Dist Name  RICE AREA (1000 ha)  \
0          1  1966          14  Chhattisgarh      Durg                548.0   
1          1  1967          14  Chhattisgarh      Durg                547.0   

   RICE PRODUCTION (1000 tons)  RICE YIELD (Kg per ha)  WHEAT AREA (1000 ha)  \
0                        185.0                  337.59                  44.0   
1                        409.0                  747.71                  50.0   

   WHEAT PRODUCTION (1000 tons)  ...  SUGARCANE YIELD (Kg per ha)  \
0                          20.0  ...                      1777.78   
1                          26.0  ...                      1500.00   

   COTTON AREA (1000 ha)  COTTON PRODUCTION (1000 tons)  \
0                    0.0                            0.0   
1                    0.0                            0.0   

   COTTON YIELD (Kg per ha)  FRUITS AREA (1000 ha)  VEGETABLES AREA (1000 ha)  \
0                       0.0                   5.95                       6.64   
1                       0.0                   5.77                       7.24   

   FRUITS AND VEGETABLES AREA (1000 ha)  POTATOES AREA (1000 ha)  \
0                                 12.59                     0.01   
1                                 13.02                     0.01   

   ONION AREA (1000 ha)  FODDER AREA (1000 ha)  
0                  0.60                   0.47  
1                  0.56                   1.23  

[2 rows x 80 columns]

Step 2: Preprocess Data¶

  • Replace -1.0 with NaN.
  • Select features: All numerical columns (exclude IDs: Dist Code, Year, State Code, State Name, Dist Name).
  • Target: State Name (encode to integers).
  • Impute NaNs with median (robust to outliers).
  • Drop rows with NaN in target (none here).
In [6]:
# Replace -1 with NaN
df.replace(-1, np.nan, inplace=True)

# Define features (numerical columns) and target
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
feature_cols = [col for col in num_cols if col not in ['Dist Code', 'Year', 'State Code']]
target_col = 'State Name'

X = df[feature_cols]
y = df[target_col]

# Encode target
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Impute missing values
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

print(f"Features shape: {X_imputed.shape}")
print(f"Target classes: {le.classes_[:5]}...")  # First 5 classes
Features shape: (16146, 75)
Target classes: ['Andhra Pradesh' 'Assam' 'Bihar' 'Chhattisgarh' 'Gujarat']...

Step 3: Split Data and Train Model¶

Use stratified split (80/20) to handle imbalance. Train Random Forest (n_estimators=100 for quick fit; tune via GridSearchCV for production).

In [7]:
# Split (stratified for class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X_imputed, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)

# Train model
clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test)
y_pred_labels = le.inverse_transform(y_pred)  # Back to state names
y_test_labels = le.inverse_transform(y_test)

print("Training complete!")
Training complete!

Explanation¶

Random Forest ensembles decision trees; handles multicollinearity in crop data. Fit time: ~10–20s on this dataset.

Step 4: Evaluate Model¶

In [8]:
# Classification report
print(classification_report(y_test, y_pred, target_names=le.classes_))

# Confusion matrix (visualize top classes for brevity)
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=False, cmap='Blues', xticklabels=le.classes_[::4], yticklabels=le.classes_[::4])  # Every 4th label
plt.title('Confusion Matrix (Subset Labels)')
plt.xlabel('Predicted State')
plt.ylabel('True State')
plt.show()

# Feature importance (top 10)
importances = pd.DataFrame({'feature': feature_cols, 'importance': clf.feature_importances_})
top_features = importances.sort_values('importance', ascending=False).head(10)
print(top_features)

# Plot top features
sns.barplot(data=top_features, x='importance', y='feature')
plt.title('Top 10 Feature Importances')
plt.show()
                  precision    recall  f1-score   support

  Andhra Pradesh       1.00      1.00      1.00       114
           Assam       1.00      0.99      1.00       104
           Bihar       1.00      1.00      1.00       114
    Chhattisgarh       1.00      1.00      1.00        62
         Gujarat       1.00      1.00      1.00       187
         Haryana       1.00      1.00      1.00        73
Himachal Pradesh       1.00      1.00      1.00       104
       Jharkhand       0.98      1.00      0.99        60
       Karnataka       1.00      1.00      1.00       198
          Kerala       0.99      1.00      1.00       103
  Madhya Pradesh       1.00      1.00      1.00       385
     Maharashtra       1.00      1.00      1.00       270
          Orissa       1.00      1.00      1.00       135
          Punjab       1.00      0.99      1.00       114
       Rajasthan       1.00      1.00      1.00       271
      Tamil Nadu       1.00      0.99      1.00       125
       Telangana       1.00      1.00      1.00        94
   Uttar Pradesh       1.00      1.00      1.00       479
     Uttarakhand       1.00      1.00      1.00        83
     West Bengal       1.00      1.00      1.00       155

        accuracy                           1.00      3230
       macro avg       1.00      1.00      1.00      3230
    weighted avg       1.00      1.00      1.00      3230

No description has been provided for this image
                                feature  importance
4          WHEAT PRODUCTION (1000 tons)    0.037195
3                  WHEAT AREA (1000 ha)    0.037189
0                   RICE AREA (1000 ha)    0.034349
72              POTATOES AREA (1000 ha)    0.030699
1           RICE PRODUCTION (1000 tons)    0.025169
64     SUGARCANE PRODUCTION (1000 tons)    0.023811
30             PIGEONPEA AREA (1000 ha)    0.023069
31     PIGEONPEA PRODUCTION (1000 tons)    0.022033
65          SUGARCANE YIELD (Kg per ha)    0.021959
16  PEARL MILLET PRODUCTION (1000 tons)    0.020814
No description has been provided for this image
In [ ]: