[karma Tshomo] - Fab Futures - Data Science
Home About

Principal Components Analysis (PCA)¶

In [9]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
np.set_printoptions(precision=1)


import pandas as pd
df = pd.read_csv("datasets/Housing.csv")
df.head()

# use numeric columns only (like pixels)
X = df.select_dtypes(include=[np.number]).values
y = df["price"].values   # color by house price

print(f"Housing data shape (records,features): {X.shape}")
Housing data shape (records,features): (545, 6)
In [10]:
# plot vs two feature values

plt.scatter(X[:,1], X[:,0], c=y)   # area vs price (similar to pixels 200 & 400)
plt.xlabel("area")
plt.ylabel("price")
plt.title("Housing Data vs area and price")
plt.colorbar(label="price")
plt.show()
No description has been provided for this image

Interpretation:

The scatter plot shows that houses with larger area generally have higher prices. As the area increases, the price tends to rise, although there is some variation. This means area is an important factor that influences the price of a house.

In [14]:
# standardize (zero mean, unit variance)

print(f"data mean: {np.mean(X):.2f}, variance: {np.var(X):.2f}")

X = X - np.mean(X, axis=0)
std = np.std(X, axis=0)
Xscale = X / np.where(std > 0, std, 1)

print(f"standardized data mean: {np.mean(Xscale):.2f}, variance: {np.var(Xscale):.2f}")
data mean: 0.00, variance: 582021618208.62
standardized data mean: -0.00, variance: 1.00

Interpratation:

Standardization shifts all features to have a mean of zero and a variance close to one. This ensures that all variables are on the same scale, preventing larger-valued features from dominating the analysis.

In [15]:
# do 5-component PCA  (same as MNIST 50 PCA, but dataset smaller)

pca = sklearn.decomposition.PCA(n_components=5)
pca.fit(Xscale)
Xpca = pca.transform(Xscale)

plt.plot(pca.explained_variance_, 'o')
plt.plot()
plt.xlabel('PCA component')
plt.ylabel('explained variance')
plt.title('Housing PCA')
plt.show()
No description has been provided for this image

Interpreation:

The PCA explained-variance plot shows how much information each principal component captures from the housing dataset. The first component explains the most variation, and each following component explains less. This helps identify which components are most important for reducing dimensionality while keeping the key patterns in the data.

In [16]:
# plot vs first two PCA components

plt.scatter(Xpca[:,0], Xpca[:,1], c=y, s=20)
plt.xlabel("principal component 1")
plt.ylabel("principal component 2")
plt.title("Housing Data vs two principal components")
plt.colorbar(label="price")
plt.show()
No description has been provided for this image

Intrepreatation:

The scatter plot of the first two PCA components shows how the houses are spread in the new reduced space. Houses with similar prices appear closer together, meaning the PCA components capture useful patterns related to price.