Density Estimation¶
Estimate the probability density function (PDF) of a dataset. It helps you understand how data is distributed — where values are concentrated, how spread out they are, and whether the distribution is normal, skewed, multimodal, etc.
✅ Two Main Types of Density Estimation
Parametric Density Estimation You assume the data follows a known distribution (like Normal/Gaussian), and estimate its parameters. Example: Fit a normal distribution Estimate mean (μ) Estimate standard deviation (σ)
Non-Parametric Density Estimation You do not assume a distribution shape. Most common method:
Kernel Density Estimation (KDE) It smooths the data to create a continuous density curve.
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# sample data
data = np.random.normal(50, 10, 200)
sns.kdeplot(data)
plt.title("Kernel Density Estimation")
plt.show()
import numpy as np
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt
# data
data = np.random.normal(0, 1, 300)[:, None]
# KDE model
kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(data)
# grid for evaluation
x = np.linspace(-4, 4, 1000)[:, None]
log_density = kde.score_samples(x)
density = np.exp(log_density)
plt.plot(x, density)
plt.title("Density Estimation using KDE")
plt.show()
Mean Marks¶
To find mean marks for same grades in three classes
import pandas as pd
datasets = pd.read_excel("datasets/Cl IVABC ICT result Analysis Term 1 2025.xlsx")
datasets
| Unnamed: 0 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | Class IVABC | Midterm 2025 | NaN | NaN | NaN | NaN |
| 1 | NaN | Class | Section | Total Students | Total Stds Passed | Total Stds Fail | Total Pass % | Total Fail % | Mean Mark | National Mean Mark |
| 2 | NaN | IV | A | 28 | 22 | 6 | 78.571429 | 21.428571 | 75.7 | 67.09 |
| 3 | NaN | IV | B | 28 | 24 | 5 | 85.714286 | 17.857143 | 71.7 | 75 |
| 4 | NaN | IV | C | 28 | 13 | 15 | 46.428571 | 53.571429 | 57.9 | 67.09 |
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
# Create the dataset
data = {
"Class": ["IV", "IV", "IV"],
"Section": ["A", "B", "C"],
"Total Students": [28, 28, 28],
"Total Stds Passed": [22, 24, 13],
"Total Stds Fail": [6, 5, 15],
"Total Pass %": [78.6, 85.7, 46.4],
"Total Fail %": [21.4, 17.9, 53.6],
"Mean Mark": [75.7, 71.7, 57.9],
"National Mean Mark": [67.09, 75, 67.09]
}
# Convert to DataFrame
df = pd.DataFrame(data)
# Display the DataFrame
print(df)
# Plot Pass % vs Sections
plt.figure(figsize=(8,5))
plt.bar(df['Section'], df['Total Pass %'], color='green', alpha=0.7, label='Pass %')
plt.bar(df['Section'], df['Total Fail %'], bottom=df['Total Pass %'], color='red', alpha=0.7, label='Fail %')
plt.ylabel('Percentage')
plt.title('Pass vs Fail Percentage by Section')
plt.legend()
plt.show()
# Compare Mean Marks with National Mean
plt.figure(figsize=(8,5))
plt.plot(df['Section'], df['Mean Mark'], marker='o', label='Class Mean Mark')
plt.plot(df['Section'], df['National Mean Mark'], marker='x', linestyle='--', label='National Mean Mark')
plt.ylabel('Marks')
plt.title('Class Mean vs National Mean')
plt.legend()
plt.show()
Class Section Total Students Total Stds Passed Total Stds Fail \ 0 IV A 28 22 6 1 IV B 28 24 5 2 IV C 28 13 15 Total Pass % Total Fail % Mean Mark National Mean Mark 0 78.6 21.4 75.7 67.09 1 85.7 17.9 71.7 75.00 2 46.4 53.6 57.9 67.09
Section A is performing above national average, Section C needs improvement" — all generated from this data.
sns.histplot(marks_series, kde=True, bins=10, color="orange")
plt.title("Histogram with Density of Students' Marks")
plt.xlabel("Marks")
plt.ylabel("Count / Density")
plt.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[8], line 1 ----> 1 sns.histplot(marks_series, kde=True, bins=10, color="orange") 2 plt.title("Histogram with Density of Students' Marks") 3 plt.xlabel("Marks") NameError: name 'marks_series' is not defined
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Example student marks
marks = [75, 80, 68, 90, 85, 77, 92, 60, 73, 88, 95, 70, 82]
# Convert to a pandas Series (optional)
marks_series = pd.Series(marks)
import pandas as pd
datasets = pd.read_excel("datasets/Cl IVABC ICT result Analysis Term 1 2025.xlsx")
datasets
| Unnamed: 0 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | Class IVABC | Midterm 2025 | NaN | NaN | NaN | NaN |
| 1 | NaN | Class | Section | Total Students | Total Stds Passed | Total Stds Fail | Total Pass % | Total Fail % | Mean Mark | National Mean Mark |
| 2 | NaN | IV | A | 28 | 22 | 6 | 78.571429 | 21.428571 | 75.7 | 67.09 |
| 3 | NaN | IV | B | 28 | 24 | 5 | 85.714286 | 17.857143 | 71.7 | 75 |
| 4 | NaN | IV | C | 28 | 13 | 15 | 46.428571 | 53.571429 | 57.9 | 67.09 |
Seaborn¶
Plot density
sns.kdeplot(marks_series, fill=True, color="skyblue")
plt.title("Density Plot of Students' Marks")
plt.xlabel("Marks")
plt.ylabel("Density")
plt.show()
Explain in detail¶
- marks_series.plot(kind='density', color='green') marks_series is your data of students’ marks, usually a Pandas Series. plot() is a Pandas function that can create various types of plots. kind='density' tells Pandas to create a Kernel Density Estimate (KDE) plot, which is a smooth curve that shows how marks are distributed. Think of it as a smoothed histogram. Peaks in the curve indicate marks where more students scored. color='green' sets the curve color to green. 2️⃣ plt.title("Density Plot of Students' Marks") plt.title() comes from Matplotlib, which Pandas uses internally for plotting. This sets the title of the plot displayed above the graph. 3️⃣ plt.xlabel("Marks") Labels the x-axis of the plot. Here, the x-axis represents the students’ marks (e.g., 0–100). 4️⃣ plt.show() Displays the plot in the Jupyter Notebook. Without this line, sometimes the plot may not render properly, especially in scripts.
marks_series.plot(kind='density', color='green')
plt.title("Density Plot of Students' Marks")
plt.xlabel("Marks")
plt.show()
Overlay histogram¶
This hows both the histogram and the smooth density curve together
sns.histplot(marks_series, kde=True, bins=10, color="orange")
plt.title("Histogram with Density of Students' Marks")
plt.xlabel("Marks")
plt.ylabel("Count / Density")
plt.show()
Histogram with density¶
Students mark and highlight passing and failing marks Assume: pass mark of 60
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Example student marks
marks = [75, 80, 68, 90, 85, 77, 92, 60, 73, 88, 95, 70, 82, 55, 50]
marks_series = pd.Series(marks)
# Define pass mark
pass_mark = 60
# Plot histogram with density
sns.histplot(marks_series, kde=True, bins=10, alpha=0.6, color="lightgrey")
# Highlight passing marks
plt.hist([m for m in marks if m >= pass_mark], bins=10, alpha=0.7, color='green', label='Pass', density=True)
# Highlight failing marks
plt.hist([m for m in marks if m < pass_mark], bins=10, alpha=0.7, color='red', label='Fail', density=True)
# Add density line
sns.kdeplot(marks_series, color='blue', linewidth=2)
plt.title("Histogram with Density and Pass/Fail Highlights")
plt.xlabel("Marks")
plt.ylabel("Density / Count")
plt.legend()
plt.show()
Explanation¶
Base histogram (lightgrey) → Shows overall marks distribution.
Passing marks (green) → Highlights marks ≥ pass mark.
Failing marks (red) → Highlights marks < pass mark.
Density curve (blue) → Smooth curve showing distribution.
density=True → Scales histogram to match density curve.
alpha → Transparency, so colors don’t completely block each other.
plt.legend() → Adds a legend for Pass/Fail.