Assignment 6: Density Estimation¶

Goal of the session: Estimate the data probability distribution.¶

The class on Density Estimation and Probability was one daunting than the other. For me I personally feel overwhelmed with all those mathematical equations, however it's quite interesting and thrilling to learn out of the comort zone once in a while. So, in order to understand the real depth of the class I have done the follwoing information search from the internet.

E-M is an algorithm used to find maximum likelihood estimates for models that depend on unobserved (latent) variables. Think of it as a smart guess-and-check process.
Circular Logic (The Smart Guess): It seems circular because you don't know the best Model (e.g., the center of a cluster) and you don't know which Data belongs to which Model (the latent variables). E-M handles this by making a guess and refining it:

E-Step (Expectation): Use the current guess of the model parameters to figure out the most likely values for the Latent Variables (e.g., "Which cluster does this data point probably belong to?").
M-Step (Maximization): Use these newly assigned latent variables to update the Model Parameters (e.g., "Now that we've assigned the points, let's move the cluster center to the average of those points").
Iteration Maximizes Likelihood: By repeating the E- and M-steps, the algorithm keeps improving its estimates, getting closer to the best possible model that explains the data (maximizing the likelihood).
Local Minima & Momentum: E-M can sometimes get stuck in a "good enough" solution (a local minimum) instead of the best solution (the global maximum). Momentum is a technique that can be added to help the algorithm "push past" these good-enough spots, giving it a better chance to find the optimal result.

Clustering is the task of grouping a set of data points such that points in the same group (cluster) are more similar to each other than to those in other groups. It's like sorting a basket of mixed fruit by type.

Analogy refered to understand Expectation Maximum algorithm¶

Imagine you have a large, unlabeled box containing a mix of pennies and nickels. The Problem: You want to figure out the average weight of a penny and the average weight of a nickel, but you can't tell them apart just by looking—you only have a scale. Also, you don't know exactly which coin is which.The Two Unknowns (Circular Logic):The Model Parameters (The Coin Averages): You don't know the true average weight of a penny ($P_{avg}$) and the true average weight of a nickel ($N_{avg}$).The Latent Variables (Which Coin is Which): You don't know if Coin A is a penny or a nickel.Since you can't calculate the average weights until you know which coins belong to which group, and you can't assign them to a group until you know the average weights, you have a circular logic problem. How E-M Solves It You start with a random guess for the average weights (your initial "model").

E-Step (Expectation: Assigning the Coins)Action: You take one coin, put it on the scale, and measure its weight.Decision: You use your current guess of the average penny weight and average nickel weight to decide which coin it is most likely to be.Example: If your current guess for a penny is 2.5 grams and a nickel is 5 grams, and the coin weighs 2.6 grams, you expect it to be a penny. The E-Step is: Using the Model to Guess the Latent Variables (Group Membership).
M-Step (Maximization: Recalculating the Averages)Action: Once you've measured all the coins and assigned them to a temporary "penny pile" or "nickel pile," you now have two groups.Recalculation: You take all the coins in the "penny pile" and calculate a new, better average weight for the penny. You do the same for the "nickel pile."The M-Step is: Using the new Latent Variables (The Piles) to Update the Model Parameters (The Averages).
Iteration (Refining the Guess)You repeat the entire process:You use the new, improved average weights from the M-Step to re-assign all the coins in a new E-Step.This will change the makeup of the piles, leading to a new, better M-Step.By iterating, your guessed average weights for pennies and nickels will get closer and closer to the true average weights, and your assignments (the latent variables) will become more accurate.

Working principles of k-Mean Clustering algorithm¶

$k$-means is one of the most common and simplest clustering algorithms. It directly uses the E-M concept: Initialization: We start by randomly choosing $k$ anchor points (or centroids) from the data. These are the initial "guesses" for the center of each cluster.

Iteration (E-M Steps):E-Step (Assign Data Points): Each data point is assigned to the closest anchor. This identifies the latent variable (which cluster the point belongs to).

M-Step (Update Anchor): The anchor point is updated (moved) to the mean (average) position of all the data points that were just assigned to it. This maximizes the position of the center based on its assigned points.

Assessing $k$ (The Elbow Method): To figure out the optimal number of clusters ($k$), you plot the Total Distance of points to their anchors (called the within-cluster sum of squares) against the number of clusters you tried. The optimal $k$ is usually found where this line starts to bend sharply, looking like an "elbow."

Assignment: K-mean clustering for Age Vs Salary¶

Identifying the best suitable value of K by using the elbow method. For which the value returned as 3

In [5]:

# Code Generated by ChatGPT, Not quite familiarized with methods and functions of the library however i have got the overall concpets of K-Mean Clustering
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv("datasets/Dataset salary 2024.csv")
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

# Drop rows with NaN in 'Age' or 'salary_in_usd' (only 'Age' is expected to have NaNs now)
df_cleaned = df.dropna(subset=['Age', 'salary_in_usd']).copy()

# Select features
X = df_cleaned[['Age', 'salary_in_usd']]

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# -----------------
# 1. Elbow Method
# -----------------
wcss = []
# Test k from 1 to 10
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

# Plot the Elbow Method
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow for suitable $k$')
plt.xlabel('Number of Clusters ($k$)')
plt.ylabel('WCSS') # Within-Cluster Sum of Squares
plt.grid(True)
plt.xticks(range(1, 11))
plt.show()

No description has been provided for this image

In [4]:

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# --- Re-run Data Preparation and Clustering to ensure data is available ---
df = pd.read_csv("datasets/Dataset salary 2024.csv")
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df_cleaned = df.dropna(subset=['Age', 'salary_in_usd']).copy()
X = df_cleaned[['Age', 'salary_in_usd']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

optimal_k = 3
kmeans = KMeans(n_clusters=optimal_k, init='k-means++', random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_scaled)
df_cleaned['Cluster'] = cluster_labels
original_centroids = scaler.inverse_transform(kmeans.cluster_centers_)

# Create DataFrame for Centroids
centroids_df = pd.DataFrame(
    original_centroids,
    columns=['Age', 'salary_in_usd']
)

plt.figure(figsize=(10, 7))

sns.scatterplot(
    x='Age',
    y='salary_in_usd',
    hue='Cluster',
    data=df_cleaned,
    palette='viridis',
    s=30, # Smaller size for data points
    legend='full',
    alpha=0.6 
)

# 2. Scatter plot of the Centroids
# Use 'viridis' palette with hue matching the cluster labels (0, 1, 2)
# Using a larger marker size (s=200) and shape ('*') for visibility
sns.scatterplot(
    x='Age',
    y='salary_in_usd',
    data=centroids_df,
    marker='*',
    s=300, # Large size for centroids
    color='red', # Using a distinct color like red or black is common, but let's try to match the hue for better association
    edgecolor='black',
    label='Centroids',
    zorder=5 # Ensure centroids are plotted on top
)

# Customize plot
plt.title(f'K-means Clustering of Salary vs. Age with Centroids ($k={optimal_k}$)')
plt.xlabel('Age')
plt.ylabel('Salary in USD')

# Format y-axis to be more readable for salary
plt.ticklabel_format(style='plain', axis='y')
plt.gca().get_yaxis().set_major_formatter(
    plt.FuncFormatter(lambda x, p: format(int(x), ','))
)

plt.grid(True, linestyle='--', alpha=0.7)

# Add centroid coordinates to the plot for clarity
for i in range(len(centroids_df)):
    plt.text(
        centroids_df.iloc[i]['Age'] + 1,
        centroids_df.iloc[i]['salary_in_usd'],
        f'C{i}',
        fontsize=12,
        weight='bold',
        color='black'
    )

plt.show()

Analysis¶

The centroids clearly show the central tendency for the three distinct groups:

C0 (Cluster 0): High Salary / Moderate Age (Approx. 247k at 51 years)
C1 (Cluster 1): Moderate Salary / Young Age (Approx. 126k at 38 years)
C2 (Cluster 2): Moderate Salary / High Age (Approx. 124k at 65 years)

In [ ]: