Kelzang Tobgyel - Fab Futures - Data Science
Home About

Assignment 7: Transformation¶

Learning about PCA and how it works: I have Learnt the following infomration from Gemini¶

Principal Component Analysis (PCA) is an unsupervised machine learning technique primarily used for dimensionality reduction. Its main goal is to transform a dataset with a large number of correlated variables (features) into a new, smaller set of uncorrelated variables called Principal Components (PCs), while retaining most of the variability (or information) present in the original data.

1. What is PCA?

Imagine you have a 3-dimensional cloud of data points (like the salary data with features like salary, age, and remote ratio). If this cloud is relatively flat—meaning the points mostly lie on a plane within that 3D space—you don't truly need three dimensions to describe the data's variation.

PCA finds the directions (or axes) of maximum variance in the data.

Core Concepts:

Principal Components (PCs): These are the new axes.

PC1 is the axis along which the data varies the most.

PC2 is the second most important axis, orthogonal (perpendicular) to PC1, capturing the next highest amount of remaining variance.

This continues until all dimensions are accounted for.

Dimensionality Reduction: By keeping only the first $k$ Principal Components (where $k$ is much smaller than the original number of features), we effectively project the high-dimensional data onto a lower-dimensional subspace, simplifying the data without losing much information.

2. Advantages for Understanding Data

Using PCA provides several powerful advantages when exploring and preparing data, especially high-dimensional datasets like the one with many job titles and locations.

A. Data Visualization

The Problem: Humans cannot visualize data in more than three dimensions.

The Solution: PCA allows you to reduce a dataset with hundreds of features down to just two or three principal components (PC1 and PC2).

Benefit: You can then plot the data in a simple 2D or 3D scatter plot. This is crucial for exploratory data analysis (EDA), as it often reveals hidden clusters, outliers, or separation between different classes (e.g., seeing how "Executive" level salaries separate from "Entry" level salaries in your data).

B. Feature Interpretation and Loadings

The Problem: In a complex model, it's hard to tell which combination of features is driving a particular outcome.

The Solution: Each Principal Component is a linear combination (a weighted sum) of the original features. The weight assigned to each original feature is called its loading.

Benefit: By examining the loadings of the top components (e.g., PC1), you can determine which original variables (like salary_in_usd, experience_level_EX, or job_title_Data Scientist) are most responsible for the largest pattern of variance in your dataset.

C. Removing Noise and Redundancy

The Problem: Real-world datasets often have noisy, correlated, or redundant features that can confuse a modeling algorithm.

The Solution: PCA isolates the most informative axes (the first few PCs) and discards the components that contain very little variance. These low-variance components are often associated with random noise or minor fluctuations.

Benefit: By using only the top $k$ components, you are essentially performing noise reduction, which can significantly improve the performance and robustness of subsequent machine learning models.

D. Solving Multicollinearity

The Problem: Multicollinearity occurs when multiple features in your dataset are highly correlated (e.g., if "Mid-Term Score" and "Final Exam Score" are both excellent predictors of "Overall Grade"). This can destabilize models like linear regression.

The Solution: The Principal Components are, by definition, uncorrelated with one another.

Benefit: Using the PCs as input features for a model eliminates multicollinearity issues entirely, leading to more stable and interpretable results.

E. Improving Model Efficiency

The Problem: Training machine learning models (especially complex ones like neural networks) on high-dimensional data is computationally expensive and slow.

The Solution: Dimensionality reduction via PCA drastically reduces the number of features.

Benefit: This speeds up the training process, requires less memory, and helps prevent the "curse of dimensionality" (where models struggle to find patterns in sparsely filled high-dimensional space).

Understading Professors code from ChatGPT¶

I have understood that the Professors code on PCA was divided into 5 major segments.

  1. Loading the data from MNIST
  2. Initial Visualization (Pixel Space): Which concludes that the first plot shows the digit colors (0-9) are completely mixed and Arbitrary pixels do not contain enough information to separate the different digits.
  3. Data Standardization (Preprocessing)
  4. Applying Principal Component Analysis (PCA) from sklearn library
  5. Analyzing Explained Variance: Where the plot pically shows the first few components (PC1, PC2) capture a large amount of the total variance, and the line quickly flattens. This confirms that 50 components are sufficient to summarize most of the data's complexity.
  6. Final Visualization(PCA Space): Unlike the initial plot using random pixels, this visualization shows that the different colored digits (0-9) are significantly grouped and separated. it successfully reduces the 784 pixel dimensions down to just 2 meaningful dimensions (PC1 and PC2) that preserve the most useful discriminatory information.

My data transform analysis: Explaining the most variance in salary.¶

From doing this exercise, i have learnt that the PCA algorithm woorks best with qunatitative data. Since, my data was a mixed collection i had to undergo the following process before applying PCA. This was the recommendation from Gemini.

  1. Cleaning: Handle missing values and convert the categorical features (like job_title and experience_level) into numerical form using One-Hot Encoding.

  2. Standardizing: Scale the numerical data so that no single feature (like salary_in_usd) dominates the analysis.

  3. Applying PCA: Run PCA, analyze the explained variance, and visualize the data in the 2D Principal Component space.

In [1]:
#Code generated from Gemini
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

# 1. Load the dataset
try:
    df = pd.read_csv('datasets/Dataset salary 2024.csv')
except FileNotFoundError:
    print("Error: 'Dataset salary 2024.csv' not found. Please ensure the file is correctly uploaded.")
    exit()

#Data Cleaning and preprpcessing
# Filter out the row with missing Age (which is represented by a grave accent `)
df = df[df['Age'] != '`']

# Convert Age to numeric, forcing errors to NaN if any new non-numeric values appear
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

# Drop rows with any remaining missing values for simplicity in this analysis
df.dropna(inplace=True)

# Define the target variable (we won't use it for PCA, but it's good practice)
# We will use 'salary_in_usd' as the numerical feature to analyze
numerical_features = ['salary_in_usd', 'Age', 'remote_ratio']
categorical_features = ['experience_level', 'employment_type', 'job_title',
                        'employee_residence', 'company_location', 'company_size']

# Separate features (X) from any potential target (though PCA is unsupervised)
X = df[numerical_features + categorical_features].copy()

# 2. Preprocessing Pipeline (Standardization and One-Hot Encoding)
# We use a ColumnTransformer to apply different preprocessing steps to different columns

# Create transformers for numerical and categorical data
numerical_transformer = StandardScaler() 
categorical_transformer = OneHotEncoder(handle_unknown='ignore') # Converting categories to numbers

# Create the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='drop' # Dropping any columns not specified above
)

# 3. Creating the PCA Pipeline
n_components_to_fit = 10
pca_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('pca', PCA(n_components=n_components_to_fit))
])

# Fit the pipeline to the data (this transforms and fits the PCA model)
X_pca_transformed = pca_pipeline.fit_transform(X)

# Extract the fitted PCA model and the variance data
pca_model = pca_pipeline['pca']
explained_variance_ratio = pca_model.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)

# 4. Visualization of Explained Variance (Scree Plot)
plt.figure(figsize=(10, 5))
plt.plot(range(1, n_components_to_fit + 1), explained_variance_ratio, marker='o', linestyle='--', label='Individual Variance')
plt.plot(range(1, n_components_to_fit + 1), cumulative_variance, marker='o', linestyle='-', color='red', label='Cumulative Variance')
plt.title('Explained Variance by Principal Component')
plt.xlabel('Principal Component Number')
plt.ylabel('Proportion of Variance Explained')
plt.xticks(range(1, n_components_to_fit + 1))
plt.grid(True)
plt.legend()
plt.show()
print(f"\nCumulative Variance Explained by First {n_components_to_fit} Components: {cumulative_variance[-1]*100:.2f}%")

# 5. Visualization of Data in the First Two Principal Components (2D Plot)
# Use 'experience_level' to color-code the plot to see if PC1/PC2 separate experience groups
experience_colors = df['experience_level'].astype('category').cat.codes

plt.figure(figsize=(10, 8))
scatter = plt.scatter(
    X_pca_transformed[:, 0], # First Principal Component (PC1)
    X_pca_transformed[:, 1], # Second Principal Component (PC2)
    c=experience_colors,
    cmap='viridis',
    s=15,
    alpha=0.6
)
plt.title('Salary Data Projected onto PC1 and PC2')
plt.xlabel(f'Principal Component 1 ({explained_variance_ratio[0]*100:.2f}% of Variance)')
plt.ylabel(f'Principal Component 2 ({explained_variance_ratio[1]*100:.2f}% of Variance)')

# Add a legend for the color coding
legend1 = plt.legend(*scatter.legend_elements(),
                    loc="lower left", title="Experience Level")
plt.gca().add_artist(legend1)

plt.grid(True, linestyle='--')
plt.show()

# 6. Interpretation of the First Component (Optional but informative)
# The PCA component weights indicate which original features contribute most to that component.
# This part requires accessing the feature names after One-Hot Encoding.

# Get the feature names after preprocessing
feature_names = list(preprocessor.named_transformers_['num'].get_feature_names_out(numerical_features))
feature_names.extend(preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features))

# Get the weights (loadings) of the first component
loadings = pca_model.components_[0]

# Create a series of features and their loadings for PC1
pc1_loadings = pd.Series(loadings, index=feature_names).sort_values(ascending=False)

print("\n--- Top 5 Features Contributing to Principal Component 1 ---")
print(pc1_loadings.head(5))
print("\n--- Bottom 5 Features Contributing to Principal Component 1 ---")
print(pc1_loadings.tail(5))
No description has been provided for this image
Cumulative Variance Explained by First 10 Components: 87.32%
No description has been provided for this image
--- Top 5 Features Contributing to Principal Component 1 ---
salary_in_usd                          0.844742
experience_level_SE                    0.148562
employee_residence_US                  0.112006
company_location_US                    0.107617
job_title_Machine Learning Engineer    0.058138
dtype: float64

--- Bottom 5 Features Contributing to Principal Component 1 ---
experience_level_EN      -0.064008
job_title_Data Analyst   -0.087230
experience_level_MI      -0.097495
Age                      -0.098933
remote_ratio             -0.447200
dtype: float64

Analysis¶

The PCA model effectively distills the core variance of the dataset into two meaningful dimensions:

  • PC1 is the dimension of Career Advancement: It directly reflects the expected progression of salary and age associated with climbing the corporate ladder.
  • PC2 is the dimension of Work Arrangement: It cleanly isolates the impact of the remote_ratio from compensation and age, providing a clear vertical separation between work models across all experience levels. The separation of experience levels along the PC1 axis is very strong, indicating that compensation and age are tightly linked and highly predictive of an individual's career standing within this dataset.
In [ ]: