Jigme Tenzin - Fab Futures - Data Science
Home About

Principal Component Analysis (PCA)¶

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

In simpler terms, PCA is a technique used to reduce the number of dimensions (features) in a dataset while retaining as much of the original variability (information) as possible.

In [3]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('datasets/StudentsPerformance.csv')

# Select numeric columns only
numeric_df = df.select_dtypes(include='number')

# Standardize numeric data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_df)

# Apply PCA (2 components)
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_data)

# Add PCA results back to a dataframe
pca_df = pd.DataFrame({
    'PC1': pca_result[:, 0],
    'PC2': pca_result[:, 1]
})

print(pca_df.head())

# Plot PCA scatter
plt.figure(figsize=(6, 5))
plt.scatter(pca_df['PC1'], pca_df['PC2'])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA Scatter Plot')
plt.show()
        PC1       PC2
0  0.560514  0.088285
1  1.719201 -0.910745
2  2.883135 -0.021999
3 -2.119921 -0.074994
4  0.988094  0.131914
No description has been provided for this image