< Home
Week 5: Assignment ~ Exploring Probability Distributions¶
In this assignment, I explored the probability distribution of common disease cases in Bhutan (2023), focusing on how cases are spread across age groups and sex.
It is Descriptive Statistics & Distribution Analysis: There are no predictions yet — just understanding the shape, spread, and patterns in the data. The goal is to answer questions like:
- Which diseases dominate case counts?
- Are case counts skewed or symmetric?
- Do certain age groups or sexes show higher variability?
- Are variables (e.g., age and disease count) linearly related (via covariance), or is there a more complex dependency (via mutual information)?
For example: Diarrhoea shows a long-tail distribution — most age/sex groups have low counts, but 1–4 years has a sharp peak (~3,000 cases), revealing a clear public health pattern.
Step 1: Loading and Cleaning the Data¶
In [9]:
import pandas as pd
import numpy as np
# Load the file
df = pd.read_csv("datasets/DataSet_CommonDiseases.csv", header=1) # header=1 skips first row (title)
df = df.dropna(how='all') # remove empty rows
df = df.fillna(0) # replace blanks with 0
df.head(3) # show first 3 rows
diarrhoea_row = df[df.iloc[:, 0] == 'Diarrhoea']
# Extract just the numbers (skip disease name)
counts = diarrhoea_row.iloc[:, 1:].values.flatten().astype(int)
# Age groups & sex labels
age_groups = ['0-29 Days', '1-11 Months', '1-4 Years', '5-9 Years',
'10-14 Years', '15-19 Years', '20-24 Years',
'25-49 Years', '50-59 Years', '60+ Years']
sexes = ['M', 'F'] * len(age_groups)
# Make tidy table
simple_df = pd.DataFrame({
'Age': age_groups * 2,
'Sex': sexes,
'Count': counts
})
simple_df
Out[9]:
| Age | Sex | Count | |
|---|---|---|---|
| 0 | 0-29 Days | M | 68 |
| 1 | 1-11 Months | F | 66 |
| 2 | 1-4 Years | M | 928 |
| 3 | 5-9 Years | F | 906 |
| 4 | 10-14 Years | M | 3303 |
| 5 | 15-19 Years | F | 2800 |
| 6 | 20-24 Years | M | 2007 |
| 7 | 25-49 Years | F | 1753 |
| 8 | 50-59 Years | M | 1717 |
| 9 | 60+ Years | F | 1378 |
| 10 | 0-29 Days | M | 1230 |
| 11 | 1-11 Months | F | 997 |
| 12 | 1-4 Years | M | 991 |
| 13 | 5-9 Years | F | 920 |
| 14 | 10-14 Years | M | 2722 |
| 15 | 15-19 Years | F | 2928 |
| 16 | 20-24 Years | M | 801 |
| 17 | 25-49 Years | F | 865 |
| 18 | 50-59 Years | M | 1318 |
| 19 | 60+ Years | F | 1387 |
Step 2: Making a histogram¶
In [10]:
import matplotlib.pyplot as plt
plt.figure(figsize=(6,4))
plt.hist(simple_df['Count'], bins=10, color='skyblue', edgecolor='black')
plt.title("How common are different case counts for Diarrhoea?")
plt.xlabel("Number of cases")
plt.ylabel("How many age/sex groups have that count")
plt.show()
Step 3: Computing Basic Stats (Mean, Std Dev)¶
In [11]:
mean_val = simple_df['Count'].mean()
std_val = simple_df['Count'].std()
print(f"Average cases per age/sex group: {mean_val:.1f}")
print(f"Standard deviation (how 'spread out' it is): {std_val:.1f}")
print(f"So typical range: {mean_val - std_val:.0f} to {mean_val + std_val:.0f}")
Average cases per age/sex group: 1454.2 Standard deviation (how 'spread out' it is): 902.1 So typical range: 552 to 2356
Step 5: Compare Male vs Female¶
In [12]:
male_avg = simple_df[simple_df['Sex'] == 'M']['Count'].mean()
female_avg = simple_df[simple_df['Sex'] == 'F']['Count'].mean()
print(f"Male avg: {male_avg:.1f}")
print(f"Female avg: {female_avg:.1f}")
print(f"Difference: {male_avg - female_avg:+.1f} (positive = more in males)")
Male avg: 1508.5 Female avg: 1400.0 Difference: +108.5 (positive = more in males)
Step 2: Displaying the data age-wise¶
In [13]:
# Group by age (sum M+F)
age_totals = simple_df.groupby('Age')['Count'].sum()
plt.figure(figsize=(8,4))
age_totals.plot(kind='bar', color='teal')
plt.title("Diarrhoea: Total cases by age group")
plt.ylabel("Total cases (M + F)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
In [ ]: