fitting¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import least_squares
df = pd.read_csv("datasets/Housing.csv")
df.head()
| price | area | bedrooms | bathrooms | stories | mainroad | guestroom | basement | hotwaterheating | airconditioning | parking | prefarea | furnishingstatus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13300000 | 7420 | 4 | 2 | 3 | yes | no | no | no | yes | 2 | yes | furnished |
| 1 | 12250000 | 8960 | 4 | 4 | 4 | yes | no | no | no | yes | 3 | no | furnished |
| 2 | 12250000 | 9960 | 3 | 2 | 2 | yes | no | yes | no | no | 2 | yes | semi-furnished |
| 3 | 12215000 | 7500 | 4 | 2 | 2 | yes | no | yes | no | yes | 3 | yes | furnished |
| 4 | 11410000 | 7420 | 4 | 1 | 2 | yes | yes | yes | no | yes | 2 | no | furnished |
polynomial¶
# Use 'area' as predictor and 'price' as target
x = df['area'].values
y = df['price'].values
x_smooth = np.linspace(x.min(), x.max(), 500)
# Fit first-order (linear) polynomial
coeff1 = np.polyfit(x, y, 1)
pfit1 = np.poly1d(coeff1)
yfit1 = pfit1(x_smooth) # evaluate first-order fit
print(f"first-order fit coefficients: {coeff1}")
first-order fit coefficients: [4.61974894e+02 2.38730848e+06]
coeff2 = np.polyfit(x, y, 2)
pfit2 = np.poly1d(coeff2)
yfit2 = pfit2(x_smooth) # evaluate second-order fit
print(f"second-order fit coefficients: {coeff2}")
second-order fit coefficients: [-4.35645185e-02 1.03518489e+03 7.95440758e+05]
plt.figure(figsize=(10,6))
plt.scatter(x, y, color='blue', alpha=0.6, label='Data')
plt.plot(x_smooth, yfit1, 'g-',linewidth=2,label='Linear fit')
plt.plot(x_smooth, yfit2, 'r-', linewidth=2,label='Quadratic fit')
plt.xlabel('Area')
plt.ylabel('Price')
plt.title('Polynomial Fit of House Prices vs Area')
plt.legend()
plt.show()
Interoretation:
The polynomial fitting section compares how well linear and quadratic models can describe the relationship between house area and price. The scatter plot of the raw data shows that larger houses generally have higher prices. The linear model, represented by a straight green line, captures this upward trend but oversimplifies the relationship.
In contrast, the quadratic model, shown in red, bends slightly and fits the data more naturally. This indicates that the relationship between area and price is not perfectly straight but follows a mildly curved pattern. Overall, the quadratic fit provides a more realistic representation of how housing prices increase with area.
Radial basis function (RBF)¶
x = df['area'].values
y = df['price'].values
npts = len(x)
ncenters = 15
np.random.seed(0)
indices = np.random.uniform(low=0, high=len(x), size=ncenters).astype(int)
centers = x[indices]
M = np.abs(np.outer(x, np.ones(ncenters)) - np.outer(np.ones(npts), centers))**3
b, residuals, rank, values = np.linalg.lstsq(M, y, rcond=None)
xfit = np.linspace(x.min(), x.max(), npts)
yfit = (np.abs(np.outer(xfit, np.ones(ncenters)) - np.outer(np.ones(npts), centers))**3) @ b
plt.figure(figsize=(10,6))
plt.plot(x, y, 'o', label='Data')
plt.plot(xfit, yfit, 'g-', label='RBF fit')
for i in range(ncenters):
plt.plot(xfit, np.abs(xfit - centers[i])**3, color=(0.75,0.75,0.75))
plt.xlabel('Area')
plt.ylabel('Price')
plt.title('RBF Fit of House Prices')
plt.legend()
plt.show()
Interpretation:
The radial basis function model offers a much more flexible approach to fitting the data. By using 15 randomly chosen centers and constructing cubic radial basis functions around them, the model produces a smooth curve that adapts to the local structure of the dataset. The green curve closely follows the ups and downs of the data points, offering a more detailed fit than the simple polynomial models.
The many grey curves in the background represent the individual radial basis components that combine to create the final fit. This method captures subtle variations and provides a more responsive shape, making RBF a powerful tool for modeling nonlinear relationships.
Nonlinear least squares¶
x = df['area'].values
y = df['price'].values
# scale x to 0-1 to make tanh fit visible
x_scaled = (x - x.min()) / (x.max() - x.min())
coeff = np.array([y.max(), 0.5, 5, 0.5]) # scale coefficients to match data magnitude
def f(coeff, x):
return coeff[0] * (coeff[1] + np.tanh(coeff[2] * (x - coeff[3])))
def residuals(coeff, x, y):
return f(coeff, x) - y
result2 = least_squares(residuals, coeff, args=(x_scaled, y), max_nfev=2)
result10 = least_squares(residuals, coeff, args=(x_scaled, y), max_nfev=10)
resultend = least_squares(residuals, coeff, args=(x_scaled, y))
x_sorted = np.sort(x_scaled)
plt.figure(figsize=(10,6))
plt.scatter(x_scaled, y, color='blue', alpha=0.6, label='data')
plt.plot(x_sorted, f(coeff, x_sorted), 'b-', label='start')
plt.plot(x_sorted, f(result2.x, x_sorted), 'c-', label='2 evaluations')
plt.plot(x_sorted, f(result10.x, x_sorted), 'g-', label='10 evaluations')
plt.plot(x_sorted, f(resultend.x, x_sorted), 'r-', label='end')
plt.xlabel('Area (scaled)')
plt.ylabel('Price')
plt.title('Nonlinear Least Squares Fit (tanh) for Housing Data')
plt.legend()
plt.show()
Interpretation:
The nonlinear least squares section uses a tanh-based model to fit the scaled area data. The graph shows different stages of the optimization process: The initial guess (blue line), the fit after 2 function evaluations (cyan), after 10 evaluations (green), and the final optimized fit (red).
At first, the function does not match the data well, but as the algorithm performs more iterations, the curve progressively improves. The final fit aligns closely with the data’s overall shape, demonstrating how nonlinear optimization gradually adjusts parameters to reduce error. This experiment shows how nonlinear models require careful tuning but can produce smooth and meaningful representations once optimized.
Overfitting¶
x = df['area'].values
y = df['price'].values
order = 15
coeff2 = np.polyfit(x, y, 2)
coeffN = np.polyfit(x, y, order)
xfit = np.linspace(x.min(), x.max(), len(x))
pfit2 = np.poly1d(coeff2)
yfit2 = pfit2(xfit)
pfitN = np.poly1d(coeffN)
yfitN = pfitN(xfit)
plt.figure(figsize=(10,6))
plt.scatter(x, y, color='blue', alpha=0.6, label='Data')
plt.plot(xfit, yfit2, 'g-', label='order 2')
plt.plot(xfit, yfitN, 'r-', label=f'order {order}')
plt.xlabel('Area')
plt.ylabel('Price')
plt.title('Polynomial Fit of House Prices')
plt.legend()
plt.show()
Interpretation:
The overfitting experiment compares a simple second-order polynomial with a highly complex fifteenth-order polynomial. The second-order fit (green) captures the general increasing trend between area and price, creating a smooth curve that reflects the underlying relationship.
In contrast, the fifteenth-order fit (red) wiggles dramatically, trying to pass through or near every point. Although it fits the training data extremely closely, the curve behaves unrealistically and fails to generalize the true pattern.
This demonstrates the problem of overfitting: using a model that is too complex causes it to learn noise rather than meaningful structure. The comparison highlights why simpler models often perform better in real-world prediction tasks.