Principal Components Analysis (PCA)¶
import numpy as np
import matplotlib.pyplot as plt
import sklearn
np.set_printoptions(precision=1)
import pandas as pd
df = pd.read_csv("datasets/Housing.csv")
df.head()
# use numeric columns only (like pixels)
X = df.select_dtypes(include=[np.number]).values
y = df["price"].values # color by house price
print(f"Housing data shape (records,features): {X.shape}")
Housing data shape (records,features): (545, 6)
# plot vs two feature values
plt.scatter(X[:,1], X[:,0], c=y) # area vs price (similar to pixels 200 & 400)
plt.xlabel("area")
plt.ylabel("price")
plt.title("Housing Data vs area and price")
plt.colorbar(label="price")
plt.show()
Interpretation:
The scatter plot shows that houses with larger area generally have higher prices. As the area increases, the price tends to rise, although there is some variation. This means area is an important factor that influences the price of a house.
# standardize (zero mean, unit variance)
print(f"data mean: {np.mean(X):.2f}, variance: {np.var(X):.2f}")
X = X - np.mean(X, axis=0)
std = np.std(X, axis=0)
Xscale = X / np.where(std > 0, std, 1)
print(f"standardized data mean: {np.mean(Xscale):.2f}, variance: {np.var(Xscale):.2f}")
data mean: 0.00, variance: 582021618208.62 standardized data mean: -0.00, variance: 1.00
Interpratation:
Standardization shifts all features to have a mean of zero and a variance close to one. This ensures that all variables are on the same scale, preventing larger-valued features from dominating the analysis.
# do 5-component PCA (same as MNIST 50 PCA, but dataset smaller)
pca = sklearn.decomposition.PCA(n_components=5)
pca.fit(Xscale)
Xpca = pca.transform(Xscale)
plt.plot(pca.explained_variance_, 'o')
plt.plot()
plt.xlabel('PCA component')
plt.ylabel('explained variance')
plt.title('Housing PCA')
plt.show()
Interpreation:
The PCA explained-variance plot shows how much information each principal component captures from the housing dataset. The first component explains the most variation, and each following component explains less. This helps identify which components are most important for reducing dimensionality while keeping the key patterns in the data.
# plot vs first two PCA components
plt.scatter(Xpca[:,0], Xpca[:,1], c=y, s=20)
plt.xlabel("principal component 1")
plt.ylabel("principal component 2")
plt.title("Housing Data vs two principal components")
plt.colorbar(label="price")
plt.show()
Intrepreatation:
The scatter plot of the first two PCA components shows how the houses are spread in the new reduced space. Houses with similar prices appear closer together, meaning the PCA components capture useful patterns related to price.