[Rinchen Khandu] - Fab Futures - Data Science
Home About

Functions and Fittings¶

Goal¶

  • adjust a function to match data

Functions¶

linear¶

  • $y = ax$

Fitting¶

  • residual: difference $\epsilon_i$ between data value $y_i$ and fit $f(x_i)$
    • $\epsilon_i = y_i-f(x_i)$
  • loss: sum of residuals
    • least squares ($L^2$ norm): sum of squares
      • $\sum_i \epsilon_i^2$
      • most common
    • $L^1$ norm: sum of absolute values
      • $\sum_i |\epsilon_i|$
      • less sensitive to outliers

linear least squares¶

  • used for models where the coefficients appear linearly
  • algorithm: Singular Value Decomposition (SVD)

polynomial¶

1D¶
  • routine: polyfit
    • find the least-squares fit for a 1D polynomial
  • function: $y=c_0+c_1x+c_2x^2$
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# ----------------------------------
# LOAD YOUR DATA
# ----------------------------------
data = pd.read_csv("~/work/rinchen-khandu/datasets/student_depression_dataset.csv")

# Select columns
x_col = "Age"
y_col = "Depression"

# Drop missing values
df = data[[x_col, y_col]].dropna()

x = df[x_col].values
y = df[y_col].values

np.set_printoptions(precision=3)

# ----------------------------------
# POLYNOMIAL FITTING
# ----------------------------------

# First-order (linear) fit
coeff1 = np.polyfit(x, y, 1)
pfit1 = np.poly1d(coeff1)

# Second-order (quadratic) fit
coeff2 = np.polyfit(x, y, 2)
pfit2 = np.poly1d(coeff2)

print(f"First-order fit coefficients (linear): {coeff1}")
print(f"Second-order fit coefficients (quadratic): {coeff2}")

# ----------------------------------
# CREATE SMOOTH FIT CURVES
# ----------------------------------
xmin, xmax = x.min(), x.max()
xfit = np.linspace(xmin, xmax, 200)

yfit1 = pfit1(xfit)
yfit2 = pfit2(xfit)

# ----------------------------------
# PLOTTING
# ----------------------------------
plt.figure(figsize=(8, 6))
plt.plot(x, y, 'o', alpha=0.6, label="Observed data")
plt.plot(xfit, yfit1, 'g-', linewidth=2, label="Linear fit")
plt.plot(xfit, yfit2, 'r-', linewidth=2, label="Quadratic fit")

plt.xlabel(x_col)
plt.ylabel(y_col)
plt.title("Linear vs Quadratic Fit using Real Data")
plt.legend()
plt.grid(True)
plt.show()
First-order fit coefficients (linear): [-0.023  1.173]
Second-order fit coefficients (quadratic): [-0.001  0.019  0.651]
No description has been provided for this image

the problems with polynomials¶

  • can't fit sharp features
  • divergences
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# ----------------------------------
# LOAD YOUR DATA
# ----------------------------------
data = pd.read_csv("~/work/rinchen-khandu/datasets/student_depression_dataset.csv")

# Choose columns
x_col = "Age"          # independent variable
y_col = "Depression"   # dependent variable

# Drop missing values
df = data[[x_col, y_col]].dropna()

x = df[x_col].values
y = df[y_col].values

# ----------------------------------
# SORT DATA (IMPORTANT FOR SMOOTH PLOTS)
# ----------------------------------
idx = np.argsort(x)
x = x[idx]
y = y[idx]

xmin, xmax = x.min(), x.max()
npts = len(x)

# ----------------------------------
# FIT POLYNOMIALS OF DIFFERENT ORDERS
# ----------------------------------
xplot = np.linspace(xmin - 0.2, xmax + 0.2, 300)

# Order 1 (linear)
coeff1 = np.polyfit(x, y, 1)
yfit1 = np.poly1d(coeff1)(xplot)

# Order 4 (moderate complexity)
coeff4 = np.polyfit(x, y, 4)
yfit4 = np.poly1d(coeff4)(xplot)

# Order 15 (high-degree / overfitting)
coeff15 = np.polyfit(x, y, 15)
yfit15 = np.poly1d(coeff15)(xplot)

# ----------------------------------
# PLOTTING
# ----------------------------------
fig = plt.figure(figsize=(8, 6))
fig.canvas.header_visible = False

plt.plot(x, y, 'bo', alpha=0.6, label='Observed data')
plt.plot(xplot, yfit1, 'g-', linewidth=2, label='Order 1 (Underfit)')
plt.plot(xplot, yfit4, 'c-', linewidth=2, label='Order 4')
plt.plot(xplot, yfit15, 'r-', linewidth=2, label='Order 15 (Overfit)')

plt.xlabel(x_col)
plt.ylabel(y_col)
plt.title("Polynomial Fit Comparison Using Real Data")
plt.legend()
plt.grid(True)
plt.show()
/tmp/ipykernel_22878/193982490.py:44: RankWarning: Polyfit may be poorly conditioned
  coeff15 = np.polyfit(x, y, 15)
No description has been provided for this image
In [ ]: