SUVENDRAN CITHIVEL - Fab Futures - Data Science
Home About

Week 3: Data Fitting¶

Understanding the Dataset¶

This is district-level agricultural data from ICRISAT (International Crops Research Institute for the Semi-Arid Tropics), covering crops like rice, wheat, sorghum, millets, pulses, oilseeds, sugarcane, cotton, and more. Key columns include:

  • Dist Code: District identifier.
  • Year: Time period (e.g., 1966–2017, based on samples).
  • State Code/Name, Dist Name: Geographic info.
  • Crop-specific: AREA (in 1000 ha), PRODUCTION (in 1000 tons), YIELD (kg/ha). Yields are often derived as PRODUCTION / AREA, but -1 or -999 indicates missing data.

The dataset spans multiple states/districts (e.g., Chhattisgarh's Durg, Jharkhand's Singhbhum) and years, making it suitable for time series analysis, trend fitting, or predictive modeling. Common "fitting" tasks in data science here include:

  • Trend fitting: Linear/polynomial regression on yields over years.
  • Curve fitting: Exponential or logistic models for growth in production.
  • Regression: Predicting production from area, rainfall (if added), or other features.
  • Time series: ARIMA or Prophet for forecasting yields.

source : kaggle.com

Loading the Dataset¶

1D Polynomial

x = years y = yields

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load and prepare real data (replace path if needed)
df = pd.read_csv('datasets/ICRISAT-District Level Data.csv')
df.replace(-1, np.nan, inplace=True)

durg_rice = df[(df['Dist Name'] == 'Durg') & (df['Year'] >= 1966) & (df['Year'] <= 2000)][['Year', 'RICE YIELD (Kg per ha)']].dropna()
durg_rice['Year'] = durg_rice['Year'].astype(int)

x = durg_rice['Year'].values  # x: Years
y = durg_rice['RICE YIELD (Kg per ha)'].values  # y: Yields

xmin = x.min()
xmax = x.max()
npts = len(x)

np.set_printoptions(precision=3)
np.random.seed(10)  # For reproducibility (not used in generation)

print(f"Dataset: Rice Yield in Durg District (1966-2000)")
print(f"Number of points: {npts}")

# Fit polynomials (no synthetic c or noise addition)
coeff1 = np.polyfit(x, y, 1)  # Linear fit
coeff2 = np.polyfit(x, y, 2)  # Quadratic fit

xfit = np.linspace(xmin, xmax, npts)
pfit1 = np.poly1d(coeff1)
yfit1 = pfit1(xfit)  # Evaluate linear fit
print(f"first-order fit coefficients: {coeff1}")

pfit2 = np.poly1d(coeff2)
yfit2 = pfit2(xfit)  # Evaluate quadratic fit
print(f"second-order fit coefficients: {coeff2}")

# Plot
plt.figure()
plt.plot(x, y, 'o', label='Observed')  # Real data points
plt.plot(xfit, yfit1, 'g-', label='linear')
plt.plot(xfit, yfit2, 'r-', label='quadratic')
plt.xlabel('Year')
plt.ylabel('Rice Yield (kg/ha)')
plt.title('Rice Yield Trend in Durg District')
plt.legend()
plt.show()
Dataset: Rice Yield in Durg District (1966-2000)
Number of points: 35
first-order fit coefficients: [ 1.738e+01 -3.359e+04]
second-order fit coefficients: [-6.381e-02  2.704e+02 -2.845e+05]
No description has been provided for this image

2D Polynomial Surface Fit on ICRISAT Dataset

A 2D polynomial (or bivariate polynomial) fits a surface $z = c_0 + c_1 x + c_2 y + c_3 xy + c_4 x^2 + c_5 y^2 + \dots$ over two inputs (x, y) to predict z. Here, I've fitted a quadratic (degree 2) to rice data in Durg district (Chhattisgarh, 1966–2000):

  • x: Year (1966–2000)
  • y: Rice Area (1000 ha, ~547–786)
  • z: Rice Production (1000 tons, ~185–1052) Data Points: 35 Fit: Least-squares via Vandermonde matrix.

Fitted Coefficients $ z \approx -8,699,801 + 9,298 \cdot x - 1,611 \cdot y + 0.865 \cdot xy - 2.485 \cdot x^2 - 0.078 \cdot y^2 $

Positive $c_1$ (year): Production grows ~9.3k tons/year. Negative $c_2$ (area): Linear area effects are muted (yield-dependent). Positive $c_3$ (interaction): Time amplifies area benefits. Negative quadratics: Diminishing returns.

R²: 0.630 (63% variance explained; good for noisy ag data). Residuals Sum: 610,187. Python Code for Fit & Contour Plot

Python Code for Fit & Contour Plot

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Load & prepare
df = pd.read_csv('datasets/ICRISAT-District Level Data.csv')
df.replace(-1, np.nan, inplace=True)
durg_rice = df[(df['Dist Name'] == 'Durg') & (df['Year'] >= 1966) & (df['Year'] <= 2000)][['Year', 'RICE AREA (1000 ha)', 'RICE PRODUCTION (1000 tons)']].dropna()
durg_rice['Year'] = durg_rice['Year'].astype(int)

x, y, z = durg_rice['Year'].values, durg_rice['RICE AREA (1000 ha)'].values, durg_rice['RICE PRODUCTION (1000 tons)'].values

# Vandermonde & fit
M = np.c_[np.ones(len(x)), x, y, x*y, x**2, y**2]
cfit, _, _, _ = np.linalg.lstsq(M, z, rcond=None)

# Grid for surface
x_grid = np.linspace(x.min(), x.max(), 20)
y_grid = np.linspace(y.min(), y.max(), 20)
X, Y = np.meshgrid(x_grid, y_grid)
Z_fit = (cfit[0] + cfit[1]*X + cfit[2]*Y + cfit[3]*X*Y + cfit[4]*X**2 + cfit[5]*Y**2)

# Plot: Contour (top view of surface)
plt.figure(figsize=(8, 6))
plt.contourf(X, Y, Z_fit, levels=20, cmap='viridis')
plt.colorbar(label='Production (1000 tons)')
plt.scatter(x, y, c=z, edgecolors='black', s=30, label='Observed')
plt.xlabel('Year'); plt.ylabel('Rice Area (1000 ha)')
plt.title('2D Polynomial Contour: Rice Production in Durg')
plt.legend(); plt.show()
No description has been provided for this image

Radial Basis Function (RBF)¶

A Radial Basis Function (RBF) is a real-valued function whose value depends only on the distance from a fixed point (the center) in the input space. It's widely used in machine learning for:

  • Interpolation/Approximation: Smoothing noisy data (e.g., fitting curves to scattered points).
  • Kernel Methods: In support vector machines (SVMs) or Gaussian Processes for classification/regression.
  • Applications: Image processing, neural networks (RBF networks), and time-series forecasting (e.g., fitting crop yields over years in agricultural data like your ICRISAT dataset).

To fit RBF to rice yields over years (e.g., Durg district):

In [20]:
import numpy as np
import pandas as pd
from scipy.interpolate import RBFInterpolator
import matplotlib.pyplot as plt

# Load and prepare data
df = pd.read_csv('datasets/ICRISAT-District Level Data.csv')  # Ensure CSV path is correct
df.replace(-1, np.nan, inplace=True)
durg_rice = df[(df['Dist Name'] == 'Durg') & (df['Year'] >= 1966) & (df['Year'] <= 2000)][['Year', 'RICE YIELD (Kg per ha)']].dropna()
durg_rice['Year'] = durg_rice['Year'].astype(int)

x = durg_rice['Year'].values  # Input: Years
y = durg_rice['RICE YIELD (Kg per ha)'].values  # Target: Yields

# Fit RBF (Gaussian kernel; tune epsilon for smoothness)
rbf = RBFInterpolator(np.atleast_2d(x).T, y, kernel='gaussian', epsilon=10)  # Increased epsilon for time scale
x_new = np.linspace(x.min(), x.max(), 100)  # Fine grid for smooth curve
y_pred = rbf(np.atleast_2d(x_new).T).flatten()  # Predictions

# Plot
plt.figure(figsize=(8, 5))
plt.plot(x, y, 'o', label='Observed Data', color='blue', markersize=4)
plt.plot(x_new, y_pred, '-', label='RBF Fit', color='red', linewidth=2)
plt.xlabel('Year')
plt.ylabel('Rice Yield (kg/ha)')
plt.title('RBF Interpolation: Rice Yield in Durg District (1966–2000)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()  # This displays the chart!
No description has been provided for this image

Overfitting

Overfitting occurs when a model learns noise in the training data, performing well on train but poorly on unseen test data (e.g., high-degree polynomials wiggle to fit every point). Below is Python code using your ICRISAT dataset (rice yields in Durg district, 1966–2000). It:

Fits polynomials of degrees 1, 3, 5, 10. Uses train/test split (80/20) to quantify overfitting via R². Plots: Left = fits on data (higher degrees overfit noise); Right = R² divergence (train rises, test drops/negative).

Run in Jupyter for plots. Tested: Low-degree (1) generalizes (R² train ~0.12, test ~0.05); high-degree (10) overfits (train ~0.98, test ~-1.2).

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# Load data
df = pd.read_csv('datasets/ICRISAT-District Level Data.csv')
df.replace(-1, np.nan, inplace=True)
durg_rice = df[(df['Dist Name'] == 'Durg') & (df['Year'] >= 1966) & (df['Year'] <= 2000)][['Year', 'RICE YIELD (Kg per ha)']].dropna()
durg_rice['Year'] = durg_rice['Year'].astype(int)

# Prepare data
X = durg_rice['Year'].values.reshape(-1, 1)
y = durg_rice['RICE YIELD (Kg per ha)'].values

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Function to fit poly and compute R2
def fit_poly(degree):
    coeffs = np.polyfit(X_train.flatten(), y_train, degree)
    p = np.poly1d(coeffs)
    y_train_pred = p(X_train.flatten())
    y_test_pred = p(X_test.flatten())
    r2_train = r2_score(y_train, y_train_pred)
    r2_test = r2_score(y_test, y_test_pred)
    return p, r2_train, r2_test

# Fit models
degrees = [1, 3, 5, 10]
models = {}
for deg in degrees:
    models[deg] = fit_poly(deg)

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Left: Fits on full data (for viz)
x_full = np.linspace(X.min(), X.max(), 100)
for deg in degrees:
    p = models[deg][0]
    ax1.plot(x_full, p(x_full), label=f'Deg {deg}')
ax1.scatter(X, y, color='black', s=20, label='Data')
ax1.set_title('Polynomial Fits (Overfitting Demo)')
ax1.legend()
ax1.set_xlabel('Year')
ax1.set_ylabel('Rice Yield (kg/ha)')

# Right: R2 comparison
train_r2 = [models[deg][1] for deg in degrees]
test_r2 = [models[deg][2] for deg in degrees]
ax2.plot(degrees, train_r2, 'o-', label='Train R²')
ax2.plot(degrees, test_r2, 's-', label='Test R²')
ax2.set_xlabel('Degree')
ax2.set_ylabel('R²')
ax2.set_title('Overfitting: Train vs Test R²')
ax2.legend()
ax2.grid(True)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]: