[Wangd Lhamo] - Fab Futures - Data Science
Home About

Density Estimation¶

Estimate the probability density function (PDF) of a dataset. It helps you understand how data is distributed — where values are concentrated, how spread out they are, and whether the distribution is normal, skewed, multimodal, etc.

✅ Two Main Types of Density Estimation

  1. Parametric Density Estimation You assume the data follows a known distribution (like Normal/Gaussian), and estimate its parameters. Example: Fit a normal distribution Estimate mean (μ) Estimate standard deviation (σ)

  2. Non-Parametric Density Estimation You do not assume a distribution shape. Most common method:

Kernel Density Estimation (KDE) It smooths the data to create a continuous density curve.

In [ ]:
 
In [2]:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# sample data
data = np.random.normal(50, 10, 200)

sns.kdeplot(data)
plt.title("Kernel Density Estimation")
plt.show()
No description has been provided for this image
In [3]:
import numpy as np
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt

# data
data = np.random.normal(0, 1, 300)[:, None]

# KDE model
kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(data)

# grid for evaluation
x = np.linspace(-4, 4, 1000)[:, None]
log_density = kde.score_samples(x)
density = np.exp(log_density)

plt.plot(x, density)
plt.title("Density Estimation using KDE")
plt.show()
No description has been provided for this image

Mean Marks¶

To find mean marks for same grades in three classes

In [5]:
import pandas as pd
datasets = pd.read_excel("datasets/Cl IVABC ICT result Analysis Term 1 2025.xlsx")
datasets
Out[5]:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9
0 NaN NaN NaN NaN Class IVABC Midterm 2025 NaN NaN NaN NaN
1 NaN Class Section Total Students Total Stds Passed Total Stds Fail Total Pass % Total Fail % Mean Mark National Mean Mark
2 NaN IV A 28 22 6 78.571429 21.428571 75.7 67.09
3 NaN IV B 28 24 5 85.714286 17.857143 71.7 75
4 NaN IV C 28 13 15 46.428571 53.571429 57.9 67.09
In [6]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

# Create the dataset
data = {
    "Class": ["IV", "IV", "IV"],
    "Section": ["A", "B", "C"],
    "Total Students": [28, 28, 28],
    "Total Stds Passed": [22, 24, 13],
    "Total Stds Fail": [6, 5, 15],
    "Total Pass %": [78.6, 85.7, 46.4],
    "Total Fail %": [21.4, 17.9, 53.6],
    "Mean Mark": [75.7, 71.7, 57.9],
    "National Mean Mark": [67.09, 75, 67.09]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

# Plot Pass % vs Sections
plt.figure(figsize=(8,5))
plt.bar(df['Section'], df['Total Pass %'], color='green', alpha=0.7, label='Pass %')
plt.bar(df['Section'], df['Total Fail %'], bottom=df['Total Pass %'], color='red', alpha=0.7, label='Fail %')
plt.ylabel('Percentage')
plt.title('Pass vs Fail Percentage by Section')
plt.legend()
plt.show()

# Compare Mean Marks with National Mean
plt.figure(figsize=(8,5))
plt.plot(df['Section'], df['Mean Mark'], marker='o', label='Class Mean Mark')
plt.plot(df['Section'], df['National Mean Mark'], marker='x', linestyle='--', label='National Mean Mark')
plt.ylabel('Marks')
plt.title('Class Mean vs National Mean')
plt.legend()
plt.show()
  Class Section  Total Students  Total Stds Passed  Total Stds Fail  \
0    IV       A              28                 22                6   
1    IV       B              28                 24                5   
2    IV       C              28                 13               15   

   Total Pass %  Total Fail %  Mean Mark  National Mean Mark  
0          78.6          21.4       75.7               67.09  
1          85.7          17.9       71.7               75.00  
2          46.4          53.6       57.9               67.09  
No description has been provided for this image
No description has been provided for this image

Section A is performing above national average, Section C needs improvement" — all generated from this data.

In [ ]:
 
In [8]:
sns.histplot(marks_series, kde=True, bins=10, color="orange")
plt.title("Histogram with Density of Students' Marks")
plt.xlabel("Marks")
plt.ylabel("Count / Density")
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 sns.histplot(marks_series, kde=True, bins=10, color="orange")
      2 plt.title("Histogram with Density of Students' Marks")
      3 plt.xlabel("Marks")

NameError: name 'marks_series' is not defined
In [10]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [12]:
# Example student marks
marks = [75, 80, 68, 90, 85, 77, 92, 60, 73, 88, 95, 70, 82]

# Convert to a pandas Series (optional)
marks_series = pd.Series(marks)
In [13]:
import pandas as pd
datasets = pd.read_excel("datasets/Cl IVABC ICT result Analysis Term 1 2025.xlsx")
datasets
Out[13]:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9
0 NaN NaN NaN NaN Class IVABC Midterm 2025 NaN NaN NaN NaN
1 NaN Class Section Total Students Total Stds Passed Total Stds Fail Total Pass % Total Fail % Mean Mark National Mean Mark
2 NaN IV A 28 22 6 78.571429 21.428571 75.7 67.09
3 NaN IV B 28 24 5 85.714286 17.857143 71.7 75
4 NaN IV C 28 13 15 46.428571 53.571429 57.9 67.09

Seaborn¶

Plot density

In [17]:
sns.kdeplot(marks_series, fill=True, color="skyblue")
plt.title("Density Plot of Students' Marks")
plt.xlabel("Marks")
plt.ylabel("Density")
plt.show()
No description has been provided for this image
In [ ]:
 

Explain in detail¶

  1. marks_series.plot(kind='density', color='green') marks_series is your data of students’ marks, usually a Pandas Series. plot() is a Pandas function that can create various types of plots. kind='density' tells Pandas to create a Kernel Density Estimate (KDE) plot, which is a smooth curve that shows how marks are distributed. Think of it as a smoothed histogram. Peaks in the curve indicate marks where more students scored. color='green' sets the curve color to green. 2️⃣ plt.title("Density Plot of Students' Marks") plt.title() comes from Matplotlib, which Pandas uses internally for plotting. This sets the title of the plot displayed above the graph. 3️⃣ plt.xlabel("Marks") Labels the x-axis of the plot. Here, the x-axis represents the students’ marks (e.g., 0–100). 4️⃣ plt.show() Displays the plot in the Jupyter Notebook. Without this line, sometimes the plot may not render properly, especially in scripts.
In [15]:
marks_series.plot(kind='density', color='green')
plt.title("Density Plot of Students' Marks")
plt.xlabel("Marks")
plt.show()
No description has been provided for this image

Overlay histogram¶

This hows both the histogram and the smooth density curve together

In [16]:
sns.histplot(marks_series, kde=True, bins=10, color="orange")
plt.title("Histogram with Density of Students' Marks")
plt.xlabel("Marks")
plt.ylabel("Count / Density")
plt.show()
No description has been provided for this image

Histogram with density¶

Students mark and highlight passing and failing marks Assume: pass mark of 60

In [18]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Example student marks
marks = [75, 80, 68, 90, 85, 77, 92, 60, 73, 88, 95, 70, 82, 55, 50]

marks_series = pd.Series(marks)

# Define pass mark
pass_mark = 60

# Plot histogram with density
sns.histplot(marks_series, kde=True, bins=10, alpha=0.6, color="lightgrey")

# Highlight passing marks
plt.hist([m for m in marks if m >= pass_mark], bins=10, alpha=0.7, color='green', label='Pass', density=True)

# Highlight failing marks
plt.hist([m for m in marks if m < pass_mark], bins=10, alpha=0.7, color='red', label='Fail', density=True)

# Add density line
sns.kdeplot(marks_series, color='blue', linewidth=2)

plt.title("Histogram with Density and Pass/Fail Highlights")
plt.xlabel("Marks")
plt.ylabel("Density / Count")
plt.legend()
plt.show()
No description has been provided for this image

Explanation¶

Base histogram (lightgrey) → Shows overall marks distribution.

Passing marks (green) → Highlights marks ≥ pass mark.

Failing marks (red) → Highlights marks < pass mark.

Density curve (blue) → Smooth curve showing distribution.

density=True → Scales histogram to match density curve.

alpha → Transparency, so colors don’t completely block each other.

plt.legend() → Adds a legend for Pass/Fail.