week05: session on Probability¶
The curriculum covers foundational probability concepts starting with quantifying uncertainty through discrete and continuous probability distributions. It explores statistical inference using log likelihood optimization (with Gaussian assumptions leading to least squares) and Bayesian priors as regularization terms. Key descriptive statistics like expectation, mean, variance, and standard deviation are defined mathematically, followed by distribution theory including long-tail patterns, multimodality, and Gaussian/normal distributions with the Central Limit Theorem explaining why sample means converge to normality. Practical applications include error reduction through averaging (where averaging N samples reduces error by 1/√N), supported by Python simulations demonstrating these statistical principles in action.
Distribution of data using probility¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load the data
df = pd.read_csv('datasets/viii_2023.csv')
# Let's first inspect the column names
print("Original columns in the CSV:")
print(df.columns.tolist())
print("\nFirst few rows:")
print(df.head())
# Clean the column names (remove extra spaces)
df.columns = [col.strip() if isinstance(col, str) else col for col in df.columns]
print("\nCleaned column names:")
print(df.columns.tolist())
# The first column appears to be the student names, let's rename it
df = df.rename(columns={df.columns[0]: 'Name'})
# Get the actual subject columns from the header row
# Based on your data, the subject columns start from index 1
subject_columns = df.columns[1:-1].tolist() # Exclude 'Name' and last column (which might be 'Pass/Fail')
print("\nSubject columns found:")
print(subject_columns)
# Remove rows that are completely empty or contain summary information
# Identify student rows (those with actual names)
mask = df['Name'].notna() & ~df['Name'].astype(str).str.contains('S.M=', na=False)
# Filter only student rows
df_students = df[mask].copy()
# Reset index
df_students = df_students.reset_index(drop=True)
print(f"\nFound {len(df_students)} student records")
# Convert score columns to numeric
for col in subject_columns:
if col in df_students.columns:
df_students[col] = pd.to_numeric(df_students[col], errors='coerce')
# Remove the last column if it's the Pass/Fail column
if df_students.columns[-1] not in subject_columns and df_students.columns[-1] != 'Name':
pass_fail_col = df_students.columns[-1]
print(f"\nLast column appears to be: '{pass_fail_col}'")
# Keep it for reference but don't include in score calculations
df_students = df_students.rename(columns={pass_fail_col: 'Result'})
print("\nFirst few student records:")
print(df_students.head())
print("\nData types:")
print(df_students.dtypes)
# Now create histograms for marks > 40
print("\n" + "="*60)
print("ANALYZING MARKS > 40 BY SUBJECT")
print("="*60)
# Create a figure with multiple subplots
num_subjects = len(subject_columns)
rows = (num_subjects + 2) // 3 # Calculate rows needed for 3 columns
fig, axes = plt.subplots(rows, 3, figsize=(15, 4*rows))
fig.suptitle('Distribution of Marks > 40 by Subject', fontsize=16, fontweight='bold')
# Flatten axes for easier iteration
if rows > 1:
axes = axes.flatten()
else:
axes = [axes] if num_subjects == 1 else axes
# Colors for each subject
colors = plt.cm.tab20(np.linspace(0, 1, num_subjects))
# Analyze each subject
summary_data = []
for idx, subject in enumerate(subject_columns):
if idx >= len(axes):
break
ax = axes[idx]
# Get marks > 40
valid_marks = df_students[subject].dropna()
marks_above_40 = valid_marks[valid_marks > 40]
if len(marks_above_40) > 0:
# Calculate statistics
total_with_marks = len(valid_marks)
students_above_40 = len(marks_above_40)
percentage_above_40 = (students_above_40 / total_with_marks) * 100
mean_above_40 = marks_above_40.mean()
# Create histogram
min_mark = marks_above_40.min()
max_mark = marks_above_40.max()
bins = np.arange(40, max_mark + 5, 5) # 5-point bins starting from 40
n, bins, patches = ax.hist(marks_above_40, bins=bins,
edgecolor='black', linewidth=1,
alpha=0.7, color=colors[idx])
# Add mean line
ax.axvline(mean_above_40, color='red', linestyle='--',
linewidth=2, label=f'Mean: {mean_above_40:.1f}')
# Customize plot
ax.set_title(f'{subject}\n({students_above_40}/{total_with_marks} > 40)',
fontsize=11, fontweight='bold')
ax.set_xlabel('Marks')
ax.set_ylabel('Number of Students')
ax.grid(True, alpha=0.3)
ax.legend(fontsize=9)
# Add count labels on top of bars
for i in range(len(n)):
if n[i] > 0:
ax.text(bins[i] + 2.5, n[i] + 0.1, f'{int(n[i])}',
ha='center', va='bottom', fontsize=8)
# Store summary
summary_data.append({
'Subject': subject,
'Total Students': total_with_marks,
'Students > 40': students_above_40,
'Percentage > 40': percentage_above_40,
'Mean (> 40)': mean_above_40,
'Min (> 40)': marks_above_40.min(),
'Max (> 40)': marks_above_40.max()
})
else:
ax.text(0.5, 0.5, f'No marks > 40\nin {subject}',
ha='center', va='center', transform=ax.transAxes, fontsize=12)
ax.set_title(f'{subject}', fontsize=11, fontweight='bold')
ax.set_xlabel('Marks')
ax.set_ylabel('Number of Students')
summary_data.append({
'Subject': subject,
'Total Students': len(valid_marks),
'Students > 40': 0,
'Percentage > 40': 0,
'Mean (> 40)': np.nan,
'Min (> 40)': np.nan,
'Max (> 40)': np.nan
})
# Remove any unused subplots
for idx in range(len(subject_columns), len(axes)):
fig.delaxes(axes[idx])
plt.tight_layout()
plt.show()
# Create summary DataFrame
summary_df = pd.DataFrame(summary_data)
print("\n" + "="*80)
print("SUMMARY STATISTICS FOR MARKS > 40")
print("="*80)
print(summary_df.to_string(index=False))
# Additional analysis
print("\n" + "="*80)
print("ADDITIONAL ANALYSIS")
print("="*80)
# Calculate overall statistics
total_students = len(df_students)
print(f"Total number of students: {total_students}")
# Check which students have marks > 40 in all subjects
if 'Result' in df_students.columns:
print(f"\nPass/Fail results available in 'Result' column")
pass_count = (df_students['Result'] == 'Pass').sum()
fail_count = (df_students['Result'] == 'Fail').sum()
print(f"Pass: {pass_count} students")
print(f"Fail: {fail_count} students")
# Calculate for each student how many subjects they scored > 40 in
above_40_counts = []
for _, row in df_students.iterrows():
count = sum(1 for subject in subject_columns
if pd.notna(row[subject]) and row[subject] > 40)
above_40_counts.append(count)
df_students['Subjects_Above_40'] = above_40_counts
print(f"\nSubjects where students scored > 40:")
print(df_students[['Name', 'Subjects_Above_40']].to_string(index=False))
# Students who scored > 40 in all subjects
all_above_40 = df_students[df_students['Subjects_Above_40'] == len(subject_columns)]
print(f"\nStudents who scored > 40 in ALL {len(subject_columns)} subjects: {len(all_above_40)}")
if len(all_above_40) > 0:
print("Names:", ", ".join(all_above_40['Name'].tolist()))
# Save the analyzed data
df_students.to_csv('analyzed_student_marks.csv', index=False)
print(f"\nAnalyzed data saved to 'analyzed_student_marks.csv'")
Original columns in the CSV:
['Unnamed: 0', 'Dzongkha', 'English', 'Geography', 'History', 'ICT ', 'Maths', 'Science', 'Unnamed: 8']
First few rows:
Unnamed: 0 Dzongkha English Geography History ICT Maths \
0 Sangay Tenzin 66.63 59.8 57 60.45 47.06 58.3
1 Sujandeep Sunar 72.13 79.35 81.88 77.2 64.75 77.6
2 Singye Dorji 69.32 70.9 58.25 63.6 59.38 60.28
3 Tenzin Wangyal Tshering 70.25 83.95 86.7 81.5 71 79.85
4 Sushmita Kami 73.69 81.85 73.18 82.05 63.31 62.05
Science Unnamed: 8
0 48.35 Pass
1 69.53 Pass
2 55.05 Pass
3 73.95 Pass
4 61.8 Pass
Cleaned column names:
['Unnamed: 0', 'Dzongkha', 'English', 'Geography', 'History', 'ICT', 'Maths', 'Science', 'Unnamed: 8']
Subject columns found:
['Dzongkha', 'English', 'Geography', 'History', 'ICT', 'Maths', 'Science']
Found 30 student records
Last column appears to be: 'Unnamed: 8'
First few student records:
Name Dzongkha English Geography History ICT \
0 Sangay Tenzin 66.63 59.80 57.00 60.45 47.06
1 Sujandeep Sunar 72.13 79.35 81.88 77.20 64.75
2 Singye Dorji 69.32 70.90 58.25 63.60 59.38
3 Tenzin Wangyal Tshering 70.25 83.95 86.70 81.50 71.00
4 Sushmita Kami 73.69 81.85 73.18 82.05 63.31
Maths Science Result
0 58.30 48.35 Pass
1 77.60 69.53 Pass
2 60.28 55.05 Pass
3 79.85 73.95 Pass
4 62.05 61.80 Pass
Data types:
Name object
Dzongkha float64
English float64
Geography float64
History float64
ICT float64
Maths float64
Science float64
Result object
dtype: object
============================================================
ANALYZING MARKS > 40 BY SUBJECT
============================================================
================================================================================
SUMMARY STATISTICS FOR MARKS > 40
================================================================================
Subject Total Students Students > 40 Percentage > 40 Mean (> 40) Min (> 40) Max (> 40)
Dzongkha 30 30 100.0 71.334667 54.07 82.94
English 30 30 100.0 73.816333 59.80 85.78
Geography 30 30 100.0 71.685000 57.00 88.25
History 30 30 100.0 71.929000 60.45 82.70
ICT 30 30 100.0 60.980667 47.06 75.31
Maths 30 30 100.0 67.632667 53.88 87.28
Science 30 30 100.0 62.162667 48.35 86.50
================================================================================
ADDITIONAL ANALYSIS
================================================================================
Total number of students: 30
Pass/Fail results available in 'Result' column
Pass: 27 students
Fail: 3 students
Subjects where students scored > 40:
Name Subjects_Above_40
Sangay Tenzin 7
Sujandeep Sunar 7
Singye Dorji 7
Tenzin Wangyal Tshering 7
Sushmita Kami 7
Singye Rada 7
Phurpa Wangmo 7
Sonam Eden 7
Nima Yangchen 7
Karma Thinley 7
Khandu Lham 7
Nisha Tamang 7
Kinley Zam 7
Sangay Yeshar Thinley 7
Dorji Dema 7
Pema Shengoen 7
Younten Gyeltshen 7
Tenzin Lhaki Choden 7
Kezang Tshering 7
Kuenga Seldon 7
Sonam Lhazom Tshering 7
Sonam Wangchuk 7
Sujal Sunar 7
Sonam Dendup 7
Sherab Lhamo 7
Tshering Pelden 7
Sonam Wangmo 7
Deki Choden 7
Sonam Tobgay Gyeltshen 7
Sonam Yoezer 7
Students who scored > 40 in ALL 7 subjects: 30
Names: Sangay Tenzin, Sujandeep Sunar, Singye Dorji, Tenzin Wangyal Tshering, Sushmita Kami, Singye Rada, Phurpa Wangmo, Sonam Eden, Nima Yangchen, Karma Thinley, Khandu Lham, Nisha Tamang, Kinley Zam, Sangay Yeshar Thinley, Dorji Dema, Pema Shengoen, Younten Gyeltshen, Tenzin Lhaki Choden, Kezang Tshering, Kuenga Seldon, Sonam Lhazom Tshering, Sonam Wangchuk, Sujal Sunar, Sonam Dendup, Sherab Lhamo, Tshering Pelden, Sonam Wangmo, Deki Choden, Sonam Tobgay Gyeltshen, Sonam Yoezer
Analyzed data saved to 'analyzed_student_marks.csv'
explanation¶
This Python code analyzes student performance data from a CSV file by cleaning the dataset to remove summary rows, dynamically identifying subject columns (accounting for potential formatting issues like extra spaces), and creating histograms that visualize the distribution of marks above 40 for each subject. It calculates statistics including mean scores, counts of students scoring above the threshold, and percentages, while also generating a summary table and identifying students who scored above 40 in all subjects. The analysis helps understand performance patterns across different subjects, with visual histograms showing score distributions and red dashed lines indicating mean values for easy interpretation.
probability distribution¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import entropy, gaussian_kde, skew
import warnings
warnings.filterwarnings('ignore')
# Load and clean data
df = pd.read_csv('datasets/viii_2023.csv')
df.columns = [col.strip() for col in df.columns]
# Get student data (skip summary rows)
df_students = df.iloc[:30].copy() # First 30 rows contain student data
# Define subject columns
subjects = ['Dzongkha', 'English', 'Geography', 'History', 'ICT', 'Maths', 'Science']
# Convert to numeric
for subject in subjects:
df_students[subject] = pd.to_numeric(df_students[subject], errors='coerce')
# Create figure for visualization
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
fig.suptitle('Probability Distribution and Entropy Analysis by Subject', fontsize=16, fontweight='bold')
axes = axes.flatten()
# Analyze each subject
for idx, subject in enumerate(subjects):
if idx >= len(axes):
break
ax = axes[idx]
# Get marks data
marks = df_students[subject].dropna()
if len(marks) > 0:
# Create histogram with probability density
n, bins, patches = ax.hist(marks, bins=15, density=True, alpha=0.6,
color='skyblue', edgecolor='black', label='Probability Density')
# Calculate bin centers for line plot
bin_centers = (bins[:-1] + bins[1:]) / 2
# Create smoothed line (using moving average)
if len(bin_centers) > 1:
smooth_x = np.linspace(marks.min(), marks.max(), 100)
# Simple kernel density estimation
bandwidth = (marks.max() - marks.min()) / 10
smooth_y = np.zeros_like(smooth_x)
for i, x in enumerate(smooth_x):
# Gaussian kernel
kernel = np.exp(-0.5 * ((marks - x) / bandwidth) ** 2)
smooth_y[i] = np.mean(kernel) / bandwidth
# Normalize
smooth_y = smooth_y / np.trapz(smooth_y, smooth_x)
# Plot smoothed probability density
ax.plot(smooth_x, smooth_y, 'r-', linewidth=2, label='Smoothed Density')
# Calculate entropy of the distribution
# Discretize for entropy calculation
hist, _ = np.histogram(marks, bins=15, density=True)
hist = hist / hist.sum() # Normalize to probability distribution
# Calculate entropy (bits)
subject_entropy = entropy(hist, base=2)
# Add statistics
ax.axvline(marks.mean(), color='green', linestyle='--', linewidth=2,
label=f'Mean: {marks.mean():.1f}')
# Customize plot
ax.set_title(f'{subject}\nEntropy: {subject_entropy:.3f} bits', fontsize=11, fontweight='bold')
ax.set_xlabel('Marks')
ax.set_ylabel('Probability Density')
ax.grid(True, alpha=0.3)
ax.legend(fontsize=8, loc='upper left')
# Set appropriate x-axis limits
ax.set_xlim(marks.min() - 5, marks.max() + 5)
else:
ax.text(0.5, 0.5, f'No data for {subject}',
ha='center', va='center', transform=ax.transAxes, fontsize=12)
ax.set_title(subject, fontsize=11, fontweight='bold')
ax.set_xlabel('Marks')
ax.set_ylabel('Probability Density')
# Remove unused subplots
for idx in range(len(subjects), len(axes)):
fig.delaxes(axes[idx])
plt.tight_layout()
plt.show()
# Calculate and display entropy summary
print("="*70)
print("ENTROPY ANALYSIS SUMMARY (in bits)")
print("="*70)
entropy_data = []
for subject in subjects:
marks = df_students[subject].dropna()
if len(marks) > 0:
# Discretize for entropy calculation
hist, _ = np.histogram(marks, bins=15, density=True)
hist = hist / hist.sum() # Normalize
# Calculate entropy
subject_entropy = entropy(hist, base=2)
# Maximum possible entropy (uniform distribution)
max_entropy = np.log2(len(hist)) if len(hist) > 0 else 0
entropy_data.append({
'Subject': subject,
'Entropy (bits)': subject_entropy,
'Max Possible': max_entropy,
'Normalized': subject_entropy / max_entropy if max_entropy > 0 else 0,
'Mean Score': marks.mean(),
'Std Dev': marks.std()
})
# Create summary DataFrame
entropy_df = pd.DataFrame(entropy_data)
print(entropy_df.to_string(index=False))
# Additional analysis: Overall distribution
print("\n" + "="*70)
print("OVERALL MARKS DISTRIBUTION ANALYSIS")
print("="*70)
# Combine all marks
all_marks = []
for subject in subjects:
marks = df_students[subject].dropna()
all_marks.extend(marks.tolist())
all_marks = np.array(all_marks)
if len(all_marks) > 0:
# Create overall histogram
fig2, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Histogram with KDE
ax1.hist(all_marks, bins=20, density=True, alpha=0.6, color='purple', edgecolor='black')
# Add KDE
kde = gaussian_kde(all_marks)
x_range = np.linspace(all_marks.min(), all_marks.max(), 200)
ax1.plot(x_range, kde(x_range), 'r-', linewidth=2, label='KDE')
ax1.set_title('Overall Marks Distribution with KDE', fontsize=12, fontweight='bold')
ax1.set_xlabel('Marks')
ax1.set_ylabel('Probability Density')
ax1.grid(True, alpha=0.3)
ax1.legend()
# Entropy by mark range - FIXED VERSION
bins = np.arange(40, 101, 10) # Creates bins: 40-50, 50-60, 60-70, 70-80, 80-90, 90-100
digitized = np.digitize(all_marks, bins)
# Calculate probability for each bin
# We need bins+1 because digitize returns values from 1 to len(bins)
# where 0 would be for values < first bin, but our first bin starts at 40
# and all marks are above 40
bin_counts = np.bincount(digitized, minlength=len(bins)+1)
# Remove the first count (marks < 40) since we have none
bin_counts = bin_counts[1:] if len(bin_counts) > len(bins) else bin_counts
# Calculate probabilities
bin_probs = bin_counts / bin_counts.sum() if bin_counts.sum() > 0 else np.zeros_like(bin_counts)
# Calculate entropy for binned data
valid_probs = bin_probs[bin_probs > 0] # Only use non-zero probabilities
bin_entropy = entropy(valid_probs, base=2) if len(valid_probs) > 0 else 0
# Bar chart of probabilities - FIXED: ensure x_pos matches bin_probs length
x_pos = np.arange(len(bin_probs))
ax2.bar(x_pos, bin_probs, alpha=0.7, color='orange', edgecolor='black')
# Create labels for the bins
bin_labels = []
for i in range(len(bins)):
if i == len(bins) - 1:
bin_labels.append(f'≥{bins[i]}')
else:
bin_labels.append(f'{bins[i]}-{bins[i+1]-1}')
# Only show labels for bins that have data
ax2.set_xticks(x_pos[:len(bin_probs)])
ax2.set_xticklabels(bin_labels[:len(bin_probs)], rotation=45)
ax2.set_title(f'Probability by Mark Range\nEntropy: {bin_entropy:.3f} bits',
fontsize=12, fontweight='bold')
ax2.set_xlabel('Mark Range')
ax2.set_ylabel('Probability')
ax2.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
# Print overall statistics
print(f"Total marks analyzed: {len(all_marks)}")
print(f"Overall mean: {all_marks.mean():.2f}")
print(f"Overall std dev: {all_marks.std():.2f}")
print(f"Overall entropy (binned): {bin_entropy:.3f} bits")
print(f"Minimum mark: {all_marks.min():.2f}")
print(f"Maximum mark: {all_marks.max():.2f}")
# Calculate skewness (distribution shape)
print(f"Skewness: {skew(all_marks):.3f}")
if skew(all_marks) > 0:
print("Distribution is positively skewed (right-tailed)")
elif skew(all_marks) < 0:
print("Distribution is negatively skewed (left-tailed)")
else:
print("Distribution is approximately symmetric")
# Print bin distribution
print("\nMark Range Distribution:")
for i, prob in enumerate(bin_probs):
if prob > 0:
if i < len(bins) - 1:
print(f" {bins[i]}-{bins[i+1]-1}: {prob:.3f} ({bin_counts[i]} marks)")
else:
print(f" ≥{bins[i]}: {prob:.3f} ({bin_counts[i]} marks)")
======================================================================
ENTROPY ANALYSIS SUMMARY (in bits)
======================================================================
Subject Entropy (bits) Max Possible Normalized Mean Score Std Dev
Dzongkha 3.402910 3.906891 0.871002 71.334667 6.454438
English 3.536243 3.906891 0.905130 73.816333 6.710281
Geography 3.377747 3.906891 0.864561 71.685000 8.364318
History 3.466248 3.906891 0.887214 71.929000 6.299523
ICT 3.523231 3.906891 0.901799 60.980667 7.237737
Maths 3.536243 3.906891 0.905130 67.632667 8.977173
Science 3.419251 3.906891 0.875185 62.162667 9.509660
======================================================================
OVERALL MARKS DISTRIBUTION ANALYSIS
======================================================================
Total marks analyzed: 210 Overall mean: 68.51 Overall std dev: 8.96 Overall entropy (binned): 1.911 bits Minimum mark: 47.06 Maximum mark: 88.25 Skewness: -0.041 Distribution is negatively skewed (left-tailed) Mark Range Distribution: 40-49: 0.014 (3 marks) 50-59: 0.148 (31 marks) 60-69: 0.390 (82 marks) 70-79: 0.333 (70 marks) 80-89: 0.114 (24 marks)
Explanation¶
This Python code performs entropy-based probability distribution analysis on student marks data. It first loads and cleans the CSV file, extracting 30 student records across 7 subjects. For each subject, it calculates Shannon entropy (in bits) to measure the uncertainty/variability in marks distributions, with higher entropy indicating more spread/disorder. The visualization includes histograms with probability density and smoothed kernel density estimation (KDE) line graphs showing continuous distributions, plus green dashed lines marking mean scores.
An overall analysis combines all marks to show the complete distribution using KDE and binned probability bars, calculating overall entropy and skewness to characterize distribution shape (right/left-tailed or symmetric). The output includes a summary table comparing entropy across subjects—where subjects like ICT with lower entropy show more consistent scores, while higher-entropy subjects have more variable performance—helping identify which subjects have predictable versus unpredictable student outcomes.