[karma Tshomo] - Fab Futures - Data Science
Home About

fitting¶

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import least_squares
In [12]:
df = pd.read_csv("datasets/Housing.csv")
df.head()
Out[12]:
price area bedrooms bathrooms stories mainroad guestroom basement hotwaterheating airconditioning parking prefarea furnishingstatus
0 13300000 7420 4 2 3 yes no no no yes 2 yes furnished
1 12250000 8960 4 4 4 yes no no no yes 3 no furnished
2 12250000 9960 3 2 2 yes no yes no no 2 yes semi-furnished
3 12215000 7500 4 2 2 yes no yes no yes 3 yes furnished
4 11410000 7420 4 1 2 yes yes yes no yes 2 no furnished

polynomial¶

In [13]:
# Use 'area' as predictor and 'price' as target
x = df['area'].values
y = df['price'].values
In [14]:
x_smooth = np.linspace(x.min(), x.max(), 500) 
In [15]:
# Fit first-order (linear) polynomial
coeff1 = np.polyfit(x, y, 1)
pfit1 = np.poly1d(coeff1)
yfit1 = pfit1(x_smooth)  # evaluate first-order fit
print(f"first-order fit coefficients: {coeff1}")
first-order fit coefficients: [4.61974894e+02 2.38730848e+06]
In [16]:
coeff2 = np.polyfit(x, y, 2)
pfit2 = np.poly1d(coeff2)
yfit2 = pfit2(x_smooth)  # evaluate second-order fit
print(f"second-order fit coefficients: {coeff2}")
second-order fit coefficients: [-4.35645185e-02  1.03518489e+03  7.95440758e+05]
In [17]:
plt.figure(figsize=(10,6))
plt.scatter(x, y, color='blue', alpha=0.6, label='Data')
plt.plot(x_smooth, yfit1, 'g-',linewidth=2,label='Linear fit')
plt.plot(x_smooth, yfit2, 'r-', linewidth=2,label='Quadratic fit')
plt.xlabel('Area')
plt.ylabel('Price')
plt.title('Polynomial Fit of House Prices vs Area')
plt.legend()
plt.show()
No description has been provided for this image

Interoretation:

The polynomial fitting section compares how well linear and quadratic models can describe the relationship between house area and price. The scatter plot of the raw data shows that larger houses generally have higher prices. The linear model, represented by a straight green line, captures this upward trend but oversimplifies the relationship.

In contrast, the quadratic model, shown in red, bends slightly and fits the data more naturally. This indicates that the relationship between area and price is not perfectly straight but follows a mildly curved pattern. Overall, the quadratic fit provides a more realistic representation of how housing prices increase with area.

Radial basis function (RBF)¶

In [18]:
x = df['area'].values
y = df['price'].values

npts = len(x)
ncenters = 15
np.random.seed(0)

indices = np.random.uniform(low=0, high=len(x), size=ncenters).astype(int)
centers = x[indices]

M = np.abs(np.outer(x, np.ones(ncenters)) - np.outer(np.ones(npts), centers))**3
b, residuals, rank, values = np.linalg.lstsq(M, y, rcond=None)

xfit = np.linspace(x.min(), x.max(), npts)
yfit = (np.abs(np.outer(xfit, np.ones(ncenters)) - np.outer(np.ones(npts), centers))**3) @ b

plt.figure(figsize=(10,6))
plt.plot(x, y, 'o', label='Data')
plt.plot(xfit, yfit, 'g-', label='RBF fit')
for i in range(ncenters):
    plt.plot(xfit, np.abs(xfit - centers[i])**3, color=(0.75,0.75,0.75))
plt.xlabel('Area')
plt.ylabel('Price')
plt.title('RBF Fit of House Prices')
plt.legend()
plt.show()
No description has been provided for this image

Interpretation:

The radial basis function model offers a much more flexible approach to fitting the data. By using 15 randomly chosen centers and constructing cubic radial basis functions around them, the model produces a smooth curve that adapts to the local structure of the dataset. The green curve closely follows the ups and downs of the data points, offering a more detailed fit than the simple polynomial models.

The many grey curves in the background represent the individual radial basis components that combine to create the final fit. This method captures subtle variations and provides a more responsive shape, making RBF a powerful tool for modeling nonlinear relationships.

Nonlinear least squares¶

In [19]:
x = df['area'].values
y = df['price'].values

# scale x to 0-1 to make tanh fit visible
x_scaled = (x - x.min()) / (x.max() - x.min())

coeff = np.array([y.max(), 0.5, 5, 0.5])  # scale coefficients to match data magnitude

def f(coeff, x):
    return coeff[0] * (coeff[1] + np.tanh(coeff[2] * (x - coeff[3])))

def residuals(coeff, x, y):
    return f(coeff, x) - y

result2 = least_squares(residuals, coeff, args=(x_scaled, y), max_nfev=2)
result10 = least_squares(residuals, coeff, args=(x_scaled, y), max_nfev=10)
resultend = least_squares(residuals, coeff, args=(x_scaled, y))

x_sorted = np.sort(x_scaled)

plt.figure(figsize=(10,6))
plt.scatter(x_scaled, y, color='blue', alpha=0.6, label='data')
plt.plot(x_sorted, f(coeff, x_sorted), 'b-', label='start')
plt.plot(x_sorted, f(result2.x, x_sorted), 'c-', label='2 evaluations')
plt.plot(x_sorted, f(result10.x, x_sorted), 'g-', label='10 evaluations')
plt.plot(x_sorted, f(resultend.x, x_sorted), 'r-', label='end')
plt.xlabel('Area (scaled)')
plt.ylabel('Price')
plt.title('Nonlinear Least Squares Fit (tanh) for Housing Data')
plt.legend()
plt.show()
No description has been provided for this image

Interpretation:

The nonlinear least squares section uses a tanh-based model to fit the scaled area data. The graph shows different stages of the optimization process: The initial guess (blue line), the fit after 2 function evaluations (cyan), after 10 evaluations (green), and the final optimized fit (red).

At first, the function does not match the data well, but as the algorithm performs more iterations, the curve progressively improves. The final fit aligns closely with the data’s overall shape, demonstrating how nonlinear optimization gradually adjusts parameters to reduce error. This experiment shows how nonlinear models require careful tuning but can produce smooth and meaningful representations once optimized.

Overfitting¶

In [20]:
x = df['area'].values
y = df['price'].values

order = 15

coeff2 = np.polyfit(x, y, 2)
coeffN = np.polyfit(x, y, order)

xfit = np.linspace(x.min(), x.max(), len(x))

pfit2 = np.poly1d(coeff2)
yfit2 = pfit2(xfit)

pfitN = np.poly1d(coeffN)
yfitN = pfitN(xfit)

plt.figure(figsize=(10,6))
plt.scatter(x, y, color='blue', alpha=0.6, label='Data')
plt.plot(xfit, yfit2, 'g-', label='order 2')
plt.plot(xfit, yfitN, 'r-', label=f'order {order}')
plt.xlabel('Area')
plt.ylabel('Price')
plt.title('Polynomial Fit of House Prices')
plt.legend()
plt.show()
No description has been provided for this image

Interpretation:

The overfitting experiment compares a simple second-order polynomial with a highly complex fifteenth-order polynomial. The second-order fit (green) captures the general increasing trend between area and price, creating a smooth curve that reflects the underlying relationship.

In contrast, the fifteenth-order fit (red) wiggles dramatically, trying to pass through or near every point. Although it fits the training data extremely closely, the curve behaves unrealistically and fails to generalize the true pattern.

This demonstrates the problem of overfitting: using a model that is too complex causes it to learn noise rather than meaningful structure. The comparison highlights why simpler models often perform better in real-world prediction tasks.