[Kuenzang Dorji] - Fab Futures - Data Science
Home About

week05: session on Probability¶

The curriculum covers foundational probability concepts starting with quantifying uncertainty through discrete and continuous probability distributions. It explores statistical inference using log likelihood optimization (with Gaussian assumptions leading to least squares) and Bayesian priors as regularization terms. Key descriptive statistics like expectation, mean, variance, and standard deviation are defined mathematically, followed by distribution theory including long-tail patterns, multimodality, and Gaussian/normal distributions with the Central Limit Theorem explaining why sample means converge to normality. Practical applications include error reduction through averaging (where averaging N samples reduces error by 1/√N), supported by Python simulations demonstrating these statistical principles in action.

Distribution of data using probility¶

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the data
df = pd.read_csv('datasets/viii_2023.csv')

# Let's first inspect the column names
print("Original columns in the CSV:")
print(df.columns.tolist())
print("\nFirst few rows:")
print(df.head())

# Clean the column names (remove extra spaces)
df.columns = [col.strip() if isinstance(col, str) else col for col in df.columns]

print("\nCleaned column names:")
print(df.columns.tolist())

# The first column appears to be the student names, let's rename it
df = df.rename(columns={df.columns[0]: 'Name'})

# Get the actual subject columns from the header row
# Based on your data, the subject columns start from index 1
subject_columns = df.columns[1:-1].tolist()  # Exclude 'Name' and last column (which might be 'Pass/Fail')

print("\nSubject columns found:")
print(subject_columns)

# Remove rows that are completely empty or contain summary information
# Identify student rows (those with actual names)
mask = df['Name'].notna() & ~df['Name'].astype(str).str.contains('S.M=', na=False)

# Filter only student rows
df_students = df[mask].copy()

# Reset index
df_students = df_students.reset_index(drop=True)

print(f"\nFound {len(df_students)} student records")

# Convert score columns to numeric
for col in subject_columns:
    if col in df_students.columns:
        df_students[col] = pd.to_numeric(df_students[col], errors='coerce')

# Remove the last column if it's the Pass/Fail column
if df_students.columns[-1] not in subject_columns and df_students.columns[-1] != 'Name':
    pass_fail_col = df_students.columns[-1]
    print(f"\nLast column appears to be: '{pass_fail_col}'")
    # Keep it for reference but don't include in score calculations
    df_students = df_students.rename(columns={pass_fail_col: 'Result'})

print("\nFirst few student records:")
print(df_students.head())

print("\nData types:")
print(df_students.dtypes)

# Now create histograms for marks > 40
print("\n" + "="*60)
print("ANALYZING MARKS > 40 BY SUBJECT")
print("="*60)

# Create a figure with multiple subplots
num_subjects = len(subject_columns)
rows = (num_subjects + 2) // 3  # Calculate rows needed for 3 columns
fig, axes = plt.subplots(rows, 3, figsize=(15, 4*rows))
fig.suptitle('Distribution of Marks > 40 by Subject', fontsize=16, fontweight='bold')

# Flatten axes for easier iteration
if rows > 1:
    axes = axes.flatten()
else:
    axes = [axes] if num_subjects == 1 else axes

# Colors for each subject
colors = plt.cm.tab20(np.linspace(0, 1, num_subjects))

# Analyze each subject
summary_data = []
for idx, subject in enumerate(subject_columns):
    if idx >= len(axes):
        break
        
    ax = axes[idx]
    
    # Get marks > 40
    valid_marks = df_students[subject].dropna()
    marks_above_40 = valid_marks[valid_marks > 40]
    
    if len(marks_above_40) > 0:
        # Calculate statistics
        total_with_marks = len(valid_marks)
        students_above_40 = len(marks_above_40)
        percentage_above_40 = (students_above_40 / total_with_marks) * 100
        mean_above_40 = marks_above_40.mean()
        
        # Create histogram
        min_mark = marks_above_40.min()
        max_mark = marks_above_40.max()
        bins = np.arange(40, max_mark + 5, 5)  # 5-point bins starting from 40
        
        n, bins, patches = ax.hist(marks_above_40, bins=bins, 
                                    edgecolor='black', linewidth=1,
                                    alpha=0.7, color=colors[idx])
        
        # Add mean line
        ax.axvline(mean_above_40, color='red', linestyle='--', 
                   linewidth=2, label=f'Mean: {mean_above_40:.1f}')
        
        # Customize plot
        ax.set_title(f'{subject}\n({students_above_40}/{total_with_marks} > 40)', 
                     fontsize=11, fontweight='bold')
        ax.set_xlabel('Marks')
        ax.set_ylabel('Number of Students')
        ax.grid(True, alpha=0.3)
        ax.legend(fontsize=9)
        
        # Add count labels on top of bars
        for i in range(len(n)):
            if n[i] > 0:
                ax.text(bins[i] + 2.5, n[i] + 0.1, f'{int(n[i])}', 
                       ha='center', va='bottom', fontsize=8)
        
        # Store summary
        summary_data.append({
            'Subject': subject,
            'Total Students': total_with_marks,
            'Students > 40': students_above_40,
            'Percentage > 40': percentage_above_40,
            'Mean (> 40)': mean_above_40,
            'Min (> 40)': marks_above_40.min(),
            'Max (> 40)': marks_above_40.max()
        })
    else:
        ax.text(0.5, 0.5, f'No marks > 40\nin {subject}', 
                ha='center', va='center', transform=ax.transAxes, fontsize=12)
        ax.set_title(f'{subject}', fontsize=11, fontweight='bold')
        ax.set_xlabel('Marks')
        ax.set_ylabel('Number of Students')
        
        summary_data.append({
            'Subject': subject,
            'Total Students': len(valid_marks),
            'Students > 40': 0,
            'Percentage > 40': 0,
            'Mean (> 40)': np.nan,
            'Min (> 40)': np.nan,
            'Max (> 40)': np.nan
        })

# Remove any unused subplots
for idx in range(len(subject_columns), len(axes)):
    fig.delaxes(axes[idx])

plt.tight_layout()
plt.show()

# Create summary DataFrame
summary_df = pd.DataFrame(summary_data)
print("\n" + "="*80)
print("SUMMARY STATISTICS FOR MARKS > 40")
print("="*80)
print(summary_df.to_string(index=False))

# Additional analysis
print("\n" + "="*80)
print("ADDITIONAL ANALYSIS")
print("="*80)

# Calculate overall statistics
total_students = len(df_students)
print(f"Total number of students: {total_students}")

# Check which students have marks > 40 in all subjects
if 'Result' in df_students.columns:
    print(f"\nPass/Fail results available in 'Result' column")
    pass_count = (df_students['Result'] == 'Pass').sum()
    fail_count = (df_students['Result'] == 'Fail').sum()
    print(f"Pass: {pass_count} students")
    print(f"Fail: {fail_count} students")

# Calculate for each student how many subjects they scored > 40 in
above_40_counts = []
for _, row in df_students.iterrows():
    count = sum(1 for subject in subject_columns 
                if pd.notna(row[subject]) and row[subject] > 40)
    above_40_counts.append(count)

df_students['Subjects_Above_40'] = above_40_counts

print(f"\nSubjects where students scored > 40:")
print(df_students[['Name', 'Subjects_Above_40']].to_string(index=False))

# Students who scored > 40 in all subjects
all_above_40 = df_students[df_students['Subjects_Above_40'] == len(subject_columns)]
print(f"\nStudents who scored > 40 in ALL {len(subject_columns)} subjects: {len(all_above_40)}")
if len(all_above_40) > 0:
    print("Names:", ", ".join(all_above_40['Name'].tolist()))

# Save the analyzed data
df_students.to_csv('analyzed_student_marks.csv', index=False)
print(f"\nAnalyzed data saved to 'analyzed_student_marks.csv'")
Original columns in the CSV:
['Unnamed: 0', 'Dzongkha', 'English', 'Geography', 'History', 'ICT ', 'Maths', 'Science', 'Unnamed: 8']

First few rows:
                Unnamed: 0 Dzongkha English Geography History   ICT   Maths  \
0            Sangay Tenzin    66.63    59.8        57   60.45  47.06   58.3   
1          Sujandeep Sunar    72.13   79.35     81.88    77.2  64.75   77.6   
2             Singye Dorji    69.32    70.9     58.25    63.6  59.38  60.28   
3  Tenzin Wangyal Tshering    70.25   83.95      86.7    81.5     71  79.85   
4            Sushmita Kami    73.69   81.85     73.18   82.05  63.31  62.05   

  Science Unnamed: 8  
0   48.35       Pass  
1   69.53       Pass  
2   55.05       Pass  
3   73.95       Pass  
4    61.8       Pass  

Cleaned column names:
['Unnamed: 0', 'Dzongkha', 'English', 'Geography', 'History', 'ICT', 'Maths', 'Science', 'Unnamed: 8']

Subject columns found:
['Dzongkha', 'English', 'Geography', 'History', 'ICT', 'Maths', 'Science']

Found 30 student records

Last column appears to be: 'Unnamed: 8'

First few student records:
                      Name  Dzongkha  English  Geography  History    ICT  \
0            Sangay Tenzin     66.63    59.80      57.00    60.45  47.06   
1          Sujandeep Sunar     72.13    79.35      81.88    77.20  64.75   
2             Singye Dorji     69.32    70.90      58.25    63.60  59.38   
3  Tenzin Wangyal Tshering     70.25    83.95      86.70    81.50  71.00   
4            Sushmita Kami     73.69    81.85      73.18    82.05  63.31   

   Maths  Science Result  
0  58.30    48.35   Pass  
1  77.60    69.53   Pass  
2  60.28    55.05   Pass  
3  79.85    73.95   Pass  
4  62.05    61.80   Pass  

Data types:
Name          object
Dzongkha     float64
English      float64
Geography    float64
History      float64
ICT          float64
Maths        float64
Science      float64
Result        object
dtype: object

============================================================
ANALYZING MARKS > 40 BY SUBJECT
============================================================
No description has been provided for this image
================================================================================
SUMMARY STATISTICS FOR MARKS > 40
================================================================================
  Subject  Total Students  Students > 40  Percentage > 40  Mean (> 40)  Min (> 40)  Max (> 40)
 Dzongkha              30             30            100.0    71.334667       54.07       82.94
  English              30             30            100.0    73.816333       59.80       85.78
Geography              30             30            100.0    71.685000       57.00       88.25
  History              30             30            100.0    71.929000       60.45       82.70
      ICT              30             30            100.0    60.980667       47.06       75.31
    Maths              30             30            100.0    67.632667       53.88       87.28
  Science              30             30            100.0    62.162667       48.35       86.50

================================================================================
ADDITIONAL ANALYSIS
================================================================================
Total number of students: 30

Pass/Fail results available in 'Result' column
Pass: 27 students
Fail: 3 students

Subjects where students scored > 40:
                   Name  Subjects_Above_40
          Sangay Tenzin                  7
        Sujandeep Sunar                  7
           Singye Dorji                  7
Tenzin Wangyal Tshering                  7
          Sushmita Kami                  7
            Singye Rada                  7
          Phurpa Wangmo                  7
             Sonam Eden                  7
          Nima Yangchen                  7
          Karma Thinley                  7
            Khandu Lham                  7
           Nisha Tamang                  7
             Kinley Zam                  7
  Sangay Yeshar Thinley                  7
             Dorji Dema                  7
          Pema Shengoen                  7
      Younten Gyeltshen                  7
    Tenzin Lhaki Choden                  7
        Kezang Tshering                  7
          Kuenga Seldon                  7
  Sonam Lhazom Tshering                  7
         Sonam Wangchuk                  7
            Sujal Sunar                  7
           Sonam Dendup                  7
           Sherab Lhamo                  7
        Tshering Pelden                  7
           Sonam Wangmo                  7
            Deki Choden                  7
 Sonam Tobgay Gyeltshen                  7
           Sonam Yoezer                  7

Students who scored > 40 in ALL 7 subjects: 30
Names: Sangay Tenzin, Sujandeep Sunar, Singye Dorji, Tenzin Wangyal Tshering, Sushmita Kami, Singye Rada, Phurpa Wangmo, Sonam Eden, Nima Yangchen, Karma Thinley, Khandu Lham, Nisha Tamang, Kinley Zam, Sangay Yeshar Thinley, Dorji Dema, Pema Shengoen, Younten Gyeltshen, Tenzin Lhaki Choden, Kezang Tshering, Kuenga Seldon, Sonam Lhazom Tshering, Sonam Wangchuk, Sujal Sunar, Sonam Dendup, Sherab Lhamo, Tshering Pelden, Sonam Wangmo, Deki Choden, Sonam Tobgay Gyeltshen, Sonam Yoezer

Analyzed data saved to 'analyzed_student_marks.csv'

explanation¶

This Python code analyzes student performance data from a CSV file by cleaning the dataset to remove summary rows, dynamically identifying subject columns (accounting for potential formatting issues like extra spaces), and creating histograms that visualize the distribution of marks above 40 for each subject. It calculates statistics including mean scores, counts of students scoring above the threshold, and percentages, while also generating a summary table and identifying students who scored above 40 in all subjects. The analysis helps understand performance patterns across different subjects, with visual histograms showing score distributions and red dashed lines indicating mean values for easy interpretation.

probability distribution¶

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import entropy, gaussian_kde, skew
import warnings
warnings.filterwarnings('ignore')

# Load and clean data
df = pd.read_csv('datasets/viii_2023.csv')
df.columns = [col.strip() for col in df.columns]

# Get student data (skip summary rows)
df_students = df.iloc[:30].copy()  # First 30 rows contain student data

# Define subject columns
subjects = ['Dzongkha', 'English', 'Geography', 'History', 'ICT', 'Maths', 'Science']

# Convert to numeric
for subject in subjects:
    df_students[subject] = pd.to_numeric(df_students[subject], errors='coerce')

# Create figure for visualization
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
fig.suptitle('Probability Distribution and Entropy Analysis by Subject', fontsize=16, fontweight='bold')
axes = axes.flatten()

# Analyze each subject
for idx, subject in enumerate(subjects):
    if idx >= len(axes):
        break
    
    ax = axes[idx]
    
    # Get marks data
    marks = df_students[subject].dropna()
    
    if len(marks) > 0:
        # Create histogram with probability density
        n, bins, patches = ax.hist(marks, bins=15, density=True, alpha=0.6, 
                                   color='skyblue', edgecolor='black', label='Probability Density')
        
        # Calculate bin centers for line plot
        bin_centers = (bins[:-1] + bins[1:]) / 2
        
        # Create smoothed line (using moving average)
        if len(bin_centers) > 1:
            smooth_x = np.linspace(marks.min(), marks.max(), 100)
            
            # Simple kernel density estimation
            bandwidth = (marks.max() - marks.min()) / 10
            smooth_y = np.zeros_like(smooth_x)
            
            for i, x in enumerate(smooth_x):
                # Gaussian kernel
                kernel = np.exp(-0.5 * ((marks - x) / bandwidth) ** 2)
                smooth_y[i] = np.mean(kernel) / bandwidth
            
            # Normalize
            smooth_y = smooth_y / np.trapz(smooth_y, smooth_x)
            
            # Plot smoothed probability density
            ax.plot(smooth_x, smooth_y, 'r-', linewidth=2, label='Smoothed Density')
        
        # Calculate entropy of the distribution
        # Discretize for entropy calculation
        hist, _ = np.histogram(marks, bins=15, density=True)
        hist = hist / hist.sum()  # Normalize to probability distribution
        
        # Calculate entropy (bits)
        subject_entropy = entropy(hist, base=2)
        
        # Add statistics
        ax.axvline(marks.mean(), color='green', linestyle='--', linewidth=2, 
                  label=f'Mean: {marks.mean():.1f}')
        
        # Customize plot
        ax.set_title(f'{subject}\nEntropy: {subject_entropy:.3f} bits', fontsize=11, fontweight='bold')
        ax.set_xlabel('Marks')
        ax.set_ylabel('Probability Density')
        ax.grid(True, alpha=0.3)
        ax.legend(fontsize=8, loc='upper left')
        
        # Set appropriate x-axis limits
        ax.set_xlim(marks.min() - 5, marks.max() + 5)
    else:
        ax.text(0.5, 0.5, f'No data for {subject}', 
                ha='center', va='center', transform=ax.transAxes, fontsize=12)
        ax.set_title(subject, fontsize=11, fontweight='bold')
        ax.set_xlabel('Marks')
        ax.set_ylabel('Probability Density')

# Remove unused subplots
for idx in range(len(subjects), len(axes)):
    fig.delaxes(axes[idx])

plt.tight_layout()
plt.show()

# Calculate and display entropy summary
print("="*70)
print("ENTROPY ANALYSIS SUMMARY (in bits)")
print("="*70)

entropy_data = []
for subject in subjects:
    marks = df_students[subject].dropna()
    
    if len(marks) > 0:
        # Discretize for entropy calculation
        hist, _ = np.histogram(marks, bins=15, density=True)
        hist = hist / hist.sum()  # Normalize
        
        # Calculate entropy
        subject_entropy = entropy(hist, base=2)
        
        # Maximum possible entropy (uniform distribution)
        max_entropy = np.log2(len(hist)) if len(hist) > 0 else 0
        
        entropy_data.append({
            'Subject': subject,
            'Entropy (bits)': subject_entropy,
            'Max Possible': max_entropy,
            'Normalized': subject_entropy / max_entropy if max_entropy > 0 else 0,
            'Mean Score': marks.mean(),
            'Std Dev': marks.std()
        })

# Create summary DataFrame
entropy_df = pd.DataFrame(entropy_data)
print(entropy_df.to_string(index=False))

# Additional analysis: Overall distribution
print("\n" + "="*70)
print("OVERALL MARKS DISTRIBUTION ANALYSIS")
print("="*70)

# Combine all marks
all_marks = []
for subject in subjects:
    marks = df_students[subject].dropna()
    all_marks.extend(marks.tolist())

all_marks = np.array(all_marks)

if len(all_marks) > 0:
    # Create overall histogram
    fig2, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Histogram with KDE
    ax1.hist(all_marks, bins=20, density=True, alpha=0.6, color='purple', edgecolor='black')
    
    # Add KDE
    kde = gaussian_kde(all_marks)
    x_range = np.linspace(all_marks.min(), all_marks.max(), 200)
    ax1.plot(x_range, kde(x_range), 'r-', linewidth=2, label='KDE')
    
    ax1.set_title('Overall Marks Distribution with KDE', fontsize=12, fontweight='bold')
    ax1.set_xlabel('Marks')
    ax1.set_ylabel('Probability Density')
    ax1.grid(True, alpha=0.3)
    ax1.legend()
    
    # Entropy by mark range - FIXED VERSION
    bins = np.arange(40, 101, 10)  # Creates bins: 40-50, 50-60, 60-70, 70-80, 80-90, 90-100
    digitized = np.digitize(all_marks, bins)
    
    # Calculate probability for each bin
    # We need bins+1 because digitize returns values from 1 to len(bins)
    # where 0 would be for values < first bin, but our first bin starts at 40
    # and all marks are above 40
    bin_counts = np.bincount(digitized, minlength=len(bins)+1)
    # Remove the first count (marks < 40) since we have none
    bin_counts = bin_counts[1:] if len(bin_counts) > len(bins) else bin_counts
    
    # Calculate probabilities
    bin_probs = bin_counts / bin_counts.sum() if bin_counts.sum() > 0 else np.zeros_like(bin_counts)
    
    # Calculate entropy for binned data
    valid_probs = bin_probs[bin_probs > 0]  # Only use non-zero probabilities
    bin_entropy = entropy(valid_probs, base=2) if len(valid_probs) > 0 else 0
    
    # Bar chart of probabilities - FIXED: ensure x_pos matches bin_probs length
    x_pos = np.arange(len(bin_probs))
    ax2.bar(x_pos, bin_probs, alpha=0.7, color='orange', edgecolor='black')
    
    # Create labels for the bins
    bin_labels = []
    for i in range(len(bins)):
        if i == len(bins) - 1:
            bin_labels.append(f'≥{bins[i]}')
        else:
            bin_labels.append(f'{bins[i]}-{bins[i+1]-1}')
    
    # Only show labels for bins that have data
    ax2.set_xticks(x_pos[:len(bin_probs)])
    ax2.set_xticklabels(bin_labels[:len(bin_probs)], rotation=45)
    
    ax2.set_title(f'Probability by Mark Range\nEntropy: {bin_entropy:.3f} bits', 
                  fontsize=12, fontweight='bold')
    ax2.set_xlabel('Mark Range')
    ax2.set_ylabel('Probability')
    ax2.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    # Print overall statistics
    print(f"Total marks analyzed: {len(all_marks)}")
    print(f"Overall mean: {all_marks.mean():.2f}")
    print(f"Overall std dev: {all_marks.std():.2f}")
    print(f"Overall entropy (binned): {bin_entropy:.3f} bits")
    print(f"Minimum mark: {all_marks.min():.2f}")
    print(f"Maximum mark: {all_marks.max():.2f}")
    
    # Calculate skewness (distribution shape)
    print(f"Skewness: {skew(all_marks):.3f}")
    if skew(all_marks) > 0:
        print("Distribution is positively skewed (right-tailed)")
    elif skew(all_marks) < 0:
        print("Distribution is negatively skewed (left-tailed)")
    else:
        print("Distribution is approximately symmetric")
    
    # Print bin distribution
    print("\nMark Range Distribution:")
    for i, prob in enumerate(bin_probs):
        if prob > 0:
            if i < len(bins) - 1:
                print(f"  {bins[i]}-{bins[i+1]-1}: {prob:.3f} ({bin_counts[i]} marks)")
            else:
                print(f"  ≥{bins[i]}: {prob:.3f} ({bin_counts[i]} marks)")
No description has been provided for this image
======================================================================
ENTROPY ANALYSIS SUMMARY (in bits)
======================================================================
  Subject  Entropy (bits)  Max Possible  Normalized  Mean Score  Std Dev
 Dzongkha        3.402910      3.906891    0.871002   71.334667 6.454438
  English        3.536243      3.906891    0.905130   73.816333 6.710281
Geography        3.377747      3.906891    0.864561   71.685000 8.364318
  History        3.466248      3.906891    0.887214   71.929000 6.299523
      ICT        3.523231      3.906891    0.901799   60.980667 7.237737
    Maths        3.536243      3.906891    0.905130   67.632667 8.977173
  Science        3.419251      3.906891    0.875185   62.162667 9.509660

======================================================================
OVERALL MARKS DISTRIBUTION ANALYSIS
======================================================================
No description has been provided for this image
Total marks analyzed: 210
Overall mean: 68.51
Overall std dev: 8.96
Overall entropy (binned): 1.911 bits
Minimum mark: 47.06
Maximum mark: 88.25
Skewness: -0.041
Distribution is negatively skewed (left-tailed)

Mark Range Distribution:
  40-49: 0.014 (3 marks)
  50-59: 0.148 (31 marks)
  60-69: 0.390 (82 marks)
  70-79: 0.333 (70 marks)
  80-89: 0.114 (24 marks)

Explanation¶

This Python code performs entropy-based probability distribution analysis on student marks data. It first loads and cleans the CSV file, extracting 30 student records across 7 subjects. For each subject, it calculates Shannon entropy (in bits) to measure the uncertainty/variability in marks distributions, with higher entropy indicating more spread/disorder. The visualization includes histograms with probability density and smoothed kernel density estimation (KDE) line graphs showing continuous distributions, plus green dashed lines marking mean scores.

An overall analysis combines all marks to show the complete distribution using KDE and binned probability bars, calculating overall entropy and skewness to characterize distribution shape (right/left-tailed or symmetric). The output includes a summary table comparing entropy across subjects—where subjects like ICT with lower entropy show more consistent scores, while higher-entropy subjects have more variable performance—helping identify which subjects have predictable versus unpredictable student outcomes.

In [ ]: