Principal Component Analysis¶
Assignment¶
Data Analysis¶
Lesson: I watched a YouTube video to understand the concept of PCA
What I learned from YouTube¶
- PCA combines factors to produce new correlated ones. But I couldn't understand PCA still. After searching I found this short video, which is helpful
This is what I understood from the short video. PCA helps in simplifing complex data. When you have a lot of information, PCA finds the most important patterns and makes the data easier to understand. It keeps the most important information while reducing the number of things you have to look at. This makes the data easier to visualize, faster for computers to use in learning, and helps find the main trends.
For example, If you think of a big messy pile of photos, PCA is like choosing the few pictures that show the most important things, so you don’t have to look at all of them to understand the big picture.
I refer to the code on the curriculum and write my own code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
# Load your dataset
df = pd.read_csv('datasets/climate.csv')
numeric_cols = [
'avg_temperature',
'humidity'
]
# Keep only available columns
numeric_cols = [col for col in numeric_cols if col in df.columns]
X = df[numeric_cols].values
print(f"Data shape (rows, features): {X.shape}")
# Standardise the data
print(f"Original mean: {np.mean(X):.2f}, variance: {np.var(X):.2f}")
X = X - np.mean(X, axis=0)
std = np.std(X, axis=0)
X_scaled = X / np.where(std > 0, std, 1)
print(f"Standardized mean: {np.mean(X_scaled):.2f}, variance: {np.var(X_scaled):.2f}")
# Perform PCA (choose number of components)
pca = sklearn.decomposition.PCA(n_components=2) # or more if you want
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)
# Plot explained variance
plt.figure(figsize=(6,4))
plt.plot(pca.explained_variance_, 'o-')
plt.xlabel('PCA component')
plt.ylabel('Explained variance')
plt.title('PCA Variance Explained')
plt.show()
# Scatter plot of first 2 PCA components
plt.figure(figsize=(6,5))
plt.scatter(X_pca[:,0], X_pca[:,1], s=5, alpha=0.6)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("Climate/Energy Data in PCA Space")
plt.show()
Data shape (rows, features): (36540, 2) Original mean: 36.78, variance: 738.49 Standardized mean: 0.00, variance: 1.00
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
# Load your dataset
df = pd.read_csv('datasets/climate.csv')
numeric_cols = [
'avg_temperature',
'humidity'
]
# Keep only available columns
numeric_cols = [col for col in numeric_cols if col in df.columns]
X = df[numeric_cols].values
print(f"Data shape (rows, features): {X.shape}")
# Standardise the data
print(f"Original mean: {np.mean(X):.2f}, variance: {np.var(X):.2f}")
X = X - np.mean(X, axis=0)
std = np.std(X, axis=0)
X_scaled = X / np.where(std > 0, std, 1)
print(f"Standardized mean: {np.mean(X_scaled):.2f}, variance: {np.var(X_scaled):.2f}")
# Perform PCA (choose number of components)
pca = sklearn.decomposition.PCA(n_components=2) # or more if you want
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)
# Plot explained variance
plt.figure(figsize=(6,4))
plt.plot(pca.explained_variance_, 'o-')
plt.xlabel('PCA component')
plt.ylabel('Explained variance')
plt.title('PCA Variance Explained')
plt.show()
# Scatter plot of first 2 PCA components
plt.figure(figsize=(6,5))
plt.scatter(X_pca[:,0], X_pca[:,1], s=5, alpha=0.6)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("Climate/Energy Data in PCA Space")
plt.show()
Data shape (rows, features): (36540, 2) Original mean: 36.78, variance: 738.49 Standardized mean: 0.00, variance: 1.00
My graph looks really bad, I want to look for the way to make it look better, ChatGPT propmt :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
df = pd.read_csv('datasets/climate.csv')
# Select numeric columns for PCA
numeric_cols = [
'avg_temperature',
'humidity'
]
# Keep only available columns
numeric_cols = [col for col in numeric_cols if col in df.columns]
X = df[numeric_cols].values
print(f"Data shape (rows, features): {X.shape}")
# Standardize the data
print(f"Original mean: {np.mean(X):.2f}, variance: {np.var(X):.2f}")
X = X - np.mean(X, axis=0)
std = np.std(X, axis=0)
X_scaled = X / np.where(std > 0, std, 1)
print(f"Standardized mean: {np.mean(X_scaled):.2f}, variance: {np.var(X_scaled):.2f}")
# -------------------------------
# Perform PCA (2 components)
# -------------------------------
pca = sklearn.decomposition.PCA(n_components=2)
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)
# -------------------------------
# Scatter plot is optional
# -------------------------------
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], s=5, alpha=0.4)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA Scatter Plot (Optional)")
plt.show()
# -------------------------------
# 2D Kernel Density Estimation (KDE) plot
# -------------------------------
plt.figure(figsize=(10,8))
sns.kdeplot(
x=X_pca[:,0],
y=X_pca[:,1],
fill=True, # Fill under the density
cmap="viridis", # Color map
thresh=0.05, # Ignore very low density areas
levels=100 # Number of contour levels
)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA 2D Density Plot (KDE)")
plt.show()
Data shape (rows, features): (36540, 2) Original mean: 36.78, variance: 738.49 Standardized mean: 0.00, variance: 1.00
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
# -------------------------------
# Load your dataset
# -------------------------------
df = pd.read_csv('datasets/climate.csv')
# Select numeric columns
numeric_cols = ['avg_temperature', 'humidity']
numeric_cols = [col for col in numeric_cols if col in df.columns]
X = df[numeric_cols].values
print(f"Data shape (rows, features): {X.shape}")
# -------------------------------
# Standardize the data
# -------------------------------
X = X - np.mean(X, axis=0)
std = np.std(X, axis=0)
X_scaled = X / np.where(std > 0, std, 1)
# -------------------------------
# Perform PCA
# -------------------------------
pca = sklearn.decomposition.PCA(n_components=2)
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)
# -------------------------------
# Explained variance plot
# -------------------------------
plt.figure(figsize=(6,4))
plt.plot(pca.explained_variance_, 'o-')
plt.xlabel('PCA component')
plt.ylabel('Explained variance')
plt.title('PCA Variance Explained')
plt.show()
# -------------------------------
# Scatter plot colored by avg_temperature
# -------------------------------
plt.figure(figsize=(8,6))
plt.scatter(
X_pca[:,0], X_pca[:,1],
c=df['avg_temperature'], # color by avg_temperature
cmap='hot', # higher temp = darker red
s=30,
alpha=0.7
)
plt.colorbar(label='avg_temperature')
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA colored by avg_temperature")
plt.show()
# -------------------------------
# Scatter plot colored by humidity
# -------------------------------
plt.figure(figsize=(8,6))
plt.scatter(
X_pca[:,0], X_pca[:,1],
c=df['humidity'], # color by humidity
cmap='Blues', # higher humidity = darker blue
s=30,
alpha=0.7
)
plt.colorbar(label='humidity')
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA colored by humidity")
plt.show()
Data shape (rows, features): (36540, 2)
I wanted to make the PCA scatter plot easier to interpret, so I color-coded the points based on the intensity of avg_temperature and humidity using ChatGPT. In the plot, higher temperature values are represented by darker colors and lower temperatures by lighter colors. Similarly, higher humidity values are shown with darker colors and lower humidity with lighter colors. This helps visualize how temperature and humidity vary across the PCA space
c=df['avg_temperature'] → colors points according to temperature values.
cmap='hot' → hotter temperatures are darker red.
c=df['humidity'] → colors points according to humidity values.
cmap='Blues' → higher humidity is darker blue.
alpha=0.7 → makes points slightly transparent so overlapping points are visible.
plt.colorbar() → adds a legend showing what the colors mean.
I dont understand anything in the graph above: Using ChatGPT to find the most important parameter from 8 parameters in my datasets
Code to find the Principal Component¶
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
df = pd.read_csv('datasets/climate.csv')
# Automatically select numeric columns for PCA
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
X = df[numeric_cols].values
print(f"Using numeric columns for PCA:\n{numeric_cols}")
print(f"Data shape (rows, features): {X.shape}")
# -------------------------------
# Standardize the data (zero mean, unit variance)
# -------------------------------
X_centered = X - np.mean(X, axis=0)
std = np.std(X_centered, axis=0)
X_scaled = X_centered / np.where(std > 0, std, 1)
# -------------------------------
# Compute PCA
# -------------------------------
pca = PCA()
pca.fit(X_scaled)
Xpca = pca.transform(X_scaled)
# -------------------------------
# Output results
# -------------------------------
print("Explained variance for each principal component:")
print(pca.explained_variance_)
print("\nExplained variance ratio for each principal component:")
print(pca.explained_variance_ratio_)
print("\nFirst 5 principal components for first 5 samples:")
print(Xpca[:5,:])
Using numeric columns for PCA: ['avg_temperature', 'humidity', 'co2_emission', 'energy_consumption', 'renewable_share', 'urban_population', 'industrial_activity_index', 'energy_price'] Data shape (rows, features): (36540, 8) Explained variance for each principal component: [1.17205956 1.01840972 1.00704672 1.00523431 0.99815784 0.98884793 0.98350078 0.82696208] Explained variance ratio for each principal component: [0.14650344 0.12729773 0.12587739 0.12565085 0.12476632 0.12360261 0.12293423 0.10336743] First 5 principal components for first 5 samples: [[ 0.08056127 0.03400484 -1.4512695 -0.36764448 -1.25677393 0.97914646 -1.35808172 1.48017631] [-0.11680034 2.18353986 -1.83227385 0.54884853 -0.91302136 0.1149289 -0.53072091 -1.03073534] [-1.06148618 -0.16435237 -0.21453156 1.11513327 -1.28541889 -0.48114672 -1.45344164 0.07112574] [-1.67319757 0.72558716 -0.00603048 0.11401061 -1.59733427 1.07827485 1.49014774 0.21943478] [-0.18283688 0.60252042 -2.08231014 0.54210842 -0.77058958 0.15063314 1.36094491 0.11424932]]
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# -------------------------------
# Load dataset
# -------------------------------
df = pd.read_csv('datasets/climate.csv')
# Select numeric columns automatically
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
X = df[numeric_cols].values
# Standardize data
X_centered = X - np.mean(X, axis=0)
std = np.std(X_centered, axis=0)
X_scaled = X_centered / np.where(std > 0, std, 1)
# -------------------------------
# Perform PCA
# -------------------------------
pca = PCA()
pca.fit(X_scaled)
# -------------------------------
# Plot explained variance ratio vs components
# -------------------------------
plt.figure(figsize=(8,5))
components = np.arange(1, len(pca.explained_variance_ratio_)+1)
plt.bar(components, pca.explained_variance_ratio_, alpha=0.7, color='skyblue')
plt.plot(components, np.cumsum(pca.explained_variance_ratio_), marker='o', color='red', label='Cumulative')
plt.xlabel('PCA Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance Ratio by PCA Component')
plt.xticks(components)
plt.legend()
plt.grid(True)
plt.show()
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
# Load dataset
df = pd.read_csv('datasets/climate.csv')
# Select numeric columns automatically
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
X = df[numeric_cols].values
# Standardize data
X_centered = X - np.mean(X, axis=0)
std = np.std(X_centered, axis=0)
X_scaled = X_centered / np.where(std > 0, std, 1)
# Perform PCA
pca = PCA()
pca.fit(X_scaled)
# Find the most important PC
explained_variance = pca.explained_variance_ratio_
most_important_pc_index = np.argmax(explained_variance) + 1 # +1 for 1-based indexing
most_important_pc_variance = explained_variance[most_important_pc_index-1]
print(f"The most important principal component is PC{most_important_pc_index}")
print(f"It explains {most_important_pc_variance*100:.2f}% of the total variance")
The most important principal component is PC1 It explains 14.65% of the total variance
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
# -------------------------------
# Load dataset
# -------------------------------
df = pd.read_csv('datasets/climate.csv')
# Automatically select numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
X = df[numeric_cols].values
# Standardize data
X_centered = X - np.mean(X, axis=0)
std = np.std(X_centered, axis=0)
X_scaled = X_centered / np.where(std > 0, std, 1)
# -------------------------------
# Perform PCA
# -------------------------------
pca = PCA()
pca.fit(X_scaled)
# -------------------------------
# Print explained variance for all PCs
# -------------------------------
explained_variance = pca.explained_variance_ratio_
for i, var in enumerate(explained_variance, start=1):
print(f"PC{i} explains {var*100:.2f}% of the variance")
# -------------------------------
# Find and print the most important PC
# -------------------------------
most_important_pc_index = np.argmax(explained_variance) + 1 # +1 for 1-based indexing
most_important_pc_variance = explained_variance[most_important_pc_index-1]
print("\n==============================")
print(f"The most important principal component is PC{most_important_pc_index}")
print(f"It explains {most_important_pc_variance*100:.2f}% of the total variance")
print("==============================")
PC1 explains 14.65% of the variance PC2 explains 12.73% of the variance PC3 explains 12.59% of the variance PC4 explains 12.57% of the variance PC5 explains 12.48% of the variance PC6 explains 12.36% of the variance PC7 explains 12.29% of the variance PC8 explains 10.34% of the variance ============================== The most important principal component is PC1 It explains 14.65% of the total variance ==============================
In my dataset, most of the components are required to make a good prediction model with PC! Having the highest total variance, which means a change in PC1(avg_temperature) will affect my data the most