Assignment¶

Density Estimation

Learned about the Density Estimation from youtube, asked Rico some questions.¶

Kernel Density Estimation : Data Science Concepts

My Understanding of Density Estimation: Kernel Density Estimation (KDE) is a core data science tool used to estimate a population’s probability density from a small sample, helping analysts understand patterns and make predictions. Instead of relying only on histograms, KDE places smooth curves, often Gaussian kernels—around each data point and averages them to infer where values are most likely, even for unobserved points. The method highlights how nearby data points contribute more to density than distant ones. Choosing the bandwidth (H), which controls how wide each kernel is, is crucial: too small makes the estimate too noisy, and too large smooths out important structure. Because the true population distribution is unknown, finding the ideal bandwidth is challenging, leading to practical approaches like rules of thumb, plug-in methods, cross-validation, and visual inspection. Overall, KDE provides a flexible, intuitive way to visualise and understand the underlying shape of data distributions.

Chapgpt prompt: Why use small data for estimation when you have a big data?¶

Speed and Efficiency: Working with millions or billions of rows can be slow and costly. A small, well-chosen sample can give you nearly the same insights much faster.

Exploration and Prototyping: Data scientists often explore patterns, build quick models, or test ideas on a small subset first. It helps them iterate rapidly before scaling up.

Memory and Computation Limits: Some algorithms (like certain density estimators, clustering methods, or simulations) become extremely slow with huge datasets. Sampling solves this.

Noise Reduction: More data isn't always better. Massive datasets often contain duplicates, errors, or highly correlated data points. A smaller cleaned sample can represent the true underlying distribution just as well.

Statistical Sufficiency: Many statistical patterns (means, distributions, correlations) stabilise with surprisingly little data. After a point, adding more data changes nothing but increases cost.

Streaming or Real-Time Settings: Sometimes data arrives continuously (e.g., sensors, clicks). You can't store everything, so you estimate from a sampled window.

Privacy and Governance: Large datasets might include sensitive information. Using a small anonymised sample reduces risk and is often required by policy.

Visualisation Limits: Plots (like KDEs) become unreadable with millions of points. Sampling produces clearer visualisations.

Clustering¶

This is what I understood from the code shared by professor Neil:

Generate fake data that forms 3 clusters

The code first creates 3 groups of points. Each group has 1000 points, drawn from a normal distribution centered at:

(0, 0)

(5, 10)

(10, 5)

So the data looks like three “clouds” of points scattered around these centers.

Define a custom K-means algorithm

The function kmeans() does the following:

Randomly pick starting centers for the clusters.
For a fixed number of steps (1000):
Compute the distance from every point to every cluster center.
Assign each point to the closest cluster.
Recalculate each cluster center as the average of the points assigned to it.
When finished, it computes the total distance from all points to their assigned cluster centers.
This is the basic K-means clustering process.

Plot the K-means results

There are two plotting functions:

plot_kmeans()
Plots data points and cluster centers as red dots.
plot_Voronoi()
Plots data points and also draws Voronoi regions, which visually show which cluster center owns which area.

Run K-means with 1, 2, 3, 4, and 5 clusters

For each number of clusters:
Run the K-means algorithm
Store the total distance (how well the clusters fit)
Plot the clusters
For 1 and 2 clusters: simple plots
For 3–5 clusters: Voronoi diagrams

Plot how the total distance changes with number of clusters

Finally, the code makes a plot where:
The x-axis = number of clusters (1 to 5)
The y-axis = total distance from points to their cluster centers
This helps you see how adding more clusters makes the clustering "tighter."

I tried on my own but couldn't fix the error, so I asked ChatGPT to fix my errors. Prompt: Fix my error

In [4]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('datasets/climate.csv')

# Take only the 'avg_temperature' column
x = df['avg_temperature'].values

# Remove missing values
x = x[~np.isnan(x)]


# 1D K-means function

def kmeans_1d(x, nclusters=3, nsteps=100):
    # Random initial cluster centers
    centers = np.random.choice(x, size=nclusters, replace=False).astype(float)

    for _ in range(nsteps):
        # Assign points to nearest center
        distances = np.abs(np.outer(x, np.ones(nclusters)) - np.outer(np.ones(len(x)), centers))
        labels = np.argmin(distances, axis=1)

        # Update centers
        for i in range(nclusters):
            idx = np.where(labels == i)[0]
            if len(idx) > 0:
                centers[i] = np.mean(x[idx])

    # Compute total distance
    total_dist = sum(np.abs(x - centers[labels]))
    return centers, labels, total_dist


# Plot 1D clusters

def plot_clusters_1d(x, centers, labels, title=''):
    plt.figure(figsize=(8,2))
    plt.scatter(x, np.zeros_like(x), c=labels, cmap='tab10', s=10)
    plt.scatter(centers, np.zeros_like(centers), c='red', s=50, label='Centers')
    for c in centers:
        plt.axvline(c, color='red', linestyle='--', alpha=0.5)
    plt.title(title)
    plt.xlabel('avg_temperature')
    plt.yticks([])
    plt.show()


# Run K-means for different numbers of clusters

results = []
for k in [1,2,3,4,5]:
    centers, labels, total_dist = kmeans_1d(x, nclusters=k, nsteps=200)
    results.append((k, total_dist))
    plot_clusters_1d(x, centers, labels, title=f'{k} clusters')


# Plot total distance vs number of clusters (Elbow method)

ks, dists = zip(*results)
plt.figure()
plt.plot(ks, dists, 'o-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Total distance')
plt.title('Elbow plot for avg_temperature')
plt.xticks(ks)
plt.show()

No description has been provided for this image

Load your temperature data

Take only the column avg_temperature from the dataset.
Remove missing values so we have clean numbers to work with.

Pick a few random starting points

These are guesses for where the "centers" of temperature clusters might be.
Assign each temperature to the nearest centre
Every temperature value is checked to see which center it’s closest to.

Update the centers

For each cluster, calculate the average of all temperatures assigned to it.
Move the center to that average.
Repeat steps 3 and 4 many times
Keep assigning points and updating centers until the centers stop moving much (or after a fixed number of steps).

Plot the results

Show temperatures on a line.
Points are colored by cluster.
Cluster centers are shown as red dots and vertical lines.

Check how well clustering worked

For each number of clusters (1–5), calculate the total distance of points from their cluster center. Plot these distances to see how adding more. The code groups your temperatures into clusters and shows visually where the main groups are. The Elbow plot tells you how many clusters make sense for your data. clusters improve the fit.

Kernel Density Estimation¶

In [7]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load your data
df = pd.read_csv('datasets/climate.csv')
x = df['avg_temperature'].values
x = x[~np.isnan(x)]  # remove missing values

# Plot KDE
plt.figure(figsize=(8,4))
sns.kdeplot(x, bw_adjust=1, fill=True)  # bw_adjust controls smoothness
plt.title('Kernel Density Estimation of avg_temperature')
plt.xlabel('avg_temperature')
plt.ylabel('Density')
plt.show()

This code visualises how temperatures are distributed in my dataset.
Instead of showing bars like a histogram, it draws a smooth curve showing where temperatures are more frequent.
Peaks in the curve: temperatures that occur more often.
Valleys: temperatures that occur less often.

In [ ]: