Bijay Rai – Fab Futures – Data Science
Home About

Contents

< Home

Week 2: Assignment ~ Data Visualization¶

In this assignment, I employed a diverse set of data visualization tools—from foundational to advanced—to explore disease morbidity patterns in Bhutan’s 2023 outpatient data. Visualizations include:

• Bar graphs for straightforward comparison of disease burden across age groups,
• Sankey diagrams to trace patient flow or comorbidity pathways between age categories and disease classes,
• 3D surface/scatter plots to model multidimensional relationships (e.g., age × disease incidence × time or geography, where applicable), and
• Scatter plots to identify outliers, correlations, or clustering by age and disease frequency.

Together, these approaches support layered insights—balancing accessibility, analytical depth, and exploratory discovery—while adhering to best practices in visual encoding and perceptual effectiveness.

Reference¶

To develop an effective and interpretable visualization of age-disaggregated disease morbidity data from Bhutan’s official health statistics, the following prompt was submitted to the Qwen large language model (Tongyi Lab, 2024):

Prompt to Qwen (2025-11-25): Using the Common Diseases dataset from Bhutan’s Annual Health Statistics 2023 (MoH, 2024), create a reliable, insightful, publication-ready Python (pandas/matplotlib) visualization of outpatient morbidity by ICD-10 disease and age group—using only clearly labeled age columns (e.g., '0-29 Days', '1-11 Months', …, '60+ Years'). Exclude ambiguous/unlabeled columns. Include a brief academic caption and proper citation.

Bar Graph¶

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load with 2 header rows
df_raw = pd.read_csv("datasets/DataSet_CommonDiseases.csv", header=[0, 1])

# Flatten multi-index columns: combine (AgeGroup, Gender) → 'AgeGroup_Gender'
# E.g., ('0-29 Days', 'M') → '0-29d_M'
new_cols = []
for col in df_raw.columns:
    if isinstance(col, tuple):
        age_group, gender = col
        if pd.isna(age_group) or pd.isna(gender):
            new_cols.append('Disease')  # first column
        else:
            # Clean age group: '0-29 Days' → '0-29d'
            ag = age_group.strip().replace(' Days', 'd').replace(' Months', 'm').replace(' Years', 'y')
            new_cols.append(f"{ag}_{gender.strip()}")
    else:
        new_cols.append(col)

df_raw.columns = new_cols

# Rename first column and drop empty rows
df_raw = df_raw.rename(columns={df_raw.columns[0]: 'Disease'})
df = df_raw.copy()

# Keep only rows with valid disease names (non-null, not empty, not summary rows)
df = df[df['Disease'].notna()]
df = df[df['Disease'].str.strip() != '']
df = df[~df['Disease'].str.contains("TOTAL|PRIORITY|Complications", case=False, na=False)]

# Extract numeric columns: pattern like '0-29d_M', '60+y_F'
age_gender_cols = [col for col in df.columns if '_' in col and col != 'Disease']
df[age_gender_cols] = df[age_gender_cols].apply(pd.to_numeric, errors='coerce').fillna(0)

# Create age-group totals (e.g., sum M+F for '0-29d')
age_groups = sorted(set(col.split('_')[0] for col in age_gender_cols))
print("✅ Detected age groups:", age_groups)

# Aggregate M+F per age group
for ag in age_groups:
    cols = [f"{ag}_M", f"{ag}_F"]
    cols = [c for c in cols if c in df.columns]
    df[ag] = df[cols].sum(axis=1)

# Compute total cases (only from age groups, no duplicates)
df['Total'] = df[age_groups].sum(axis=1)

# Filter and select top diseases
df = df[df['Total'] > 0].copy()
top_n = 15
top_df = df.nlargest(top_n, 'Total').sort_values('Total', ascending=True)

# --- Plot ---
fig, ax = plt.subplots(figsize=(14, 9))

y_pos = np.arange(len(top_df))
left = np.zeros(len(top_df))

# Use viridis for clear ordering
colors = plt.cm.viridis(np.linspace(0.2, 0.9, len(age_groups)))

for i, ag in enumerate(age_groups):
    widths = top_df[ag].values
    ax.barh(y_pos, widths, left=left, label=ag, color=colors[i], edgecolor='white', linewidth=0.4)
    left += widths

# Labels
ax.set_yticks(y_pos)
ax.set_yticklabels([name[:35] + '...' if len(name) > 35 else name 
                    for name in top_df['Disease']], fontsize=10)
ax.set_xlabel('Number of Cases (Male + Female)', fontsize=12)
ax.set_title(f'Top {top_n} Diseases by Age Group (M+F Combined)\nStacked Horizontal Bar Plot', 
             fontsize=14, fontweight='bold', pad=15)

# Legend & grid
ax.legend(title='Age Group', bbox_to_anchor=(1.02, 1), loc='upper left', fontsize=9)
ax.grid(axis='x', linestyle='--', alpha=0.6)
ax.set_axisbelow(True)

# Annotate totals
for i, total in enumerate(top_df['Total']):
    ax.text(left.max() * 1.01, y_pos[i], f"{int(total):,}", 
            va='center', ha='left', fontsize=9, fontweight='bold', color='gray')

plt.tight_layout()
plt.show()
✅ Detected age groups: ['0-29d', '1-11m', '1-4y', '10-14y', '15-19y', '20-24y', '25-49y', '5-9y', '50-59y', '60+y', 'Unnamed: 10', 'Unnamed: 12', 'Unnamed: 14', 'Unnamed: 16', 'Unnamed: 18', 'Unnamed: 2', 'Unnamed: 20', 'Unnamed: 4', 'Unnamed: 6', 'Unnamed: 8']
No description has been provided for this image

Sankey Diagram¶

In [13]:
# ✅ 1. Load and parse the multi-header CSV correctly
import pandas as pd
import plotly.graph_objects as go

# Read raw CSV
lines = []
with open("datasets/DataSet_CommonDiseases.csv", "r", encoding="utf-8") as f:
    lines = f.readlines()

# Extract header rows
col1_row = lines[0].strip().split(",")
gender_row = lines[1].strip().split(",")

# Build clean column names: '0-29d_M', '0-29d_F', etc.
clean_cols = ["Disease"]
for i in range(1, len(col1_row), 2):  # skip every other (empty cols between age groups)
    age_raw = col1_row[i].strip()
    if not age_raw:
        continue
    # Normalize: "0-29 Days" → "0-29d", "60+ Years" → "60+y"
    age_clean = (
        age_raw.replace(" Days", "d")
               .replace(" Months", "m")
               .replace(" Years", "y")
               .replace("+", "plus")  # avoid special chars
    )
    # Append M and F
    clean_cols.append(f"{age_clean}_M")
    clean_cols.append(f"{age_clean}_F")

# Load data rows (skip first 2 header + any blank rows)
data_rows = []
for line in lines[2:]:
    cells = line.strip().split(",")
    if len(cells) < 5 or not cells[0].strip():
        continue
    # Pad/trim to match clean_cols length
    row = [cells[0]] + cells[1:2*10 + 1]  # 1 disease + 20 age-gender
    row = row[:len(clean_cols)]
    while len(row) < len(clean_cols):
        row.append("")
    data_rows.append(row)

# Create DataFrame
df = pd.DataFrame(data_rows, columns=clean_cols)

# Clean disease names & filter
df["Disease"] = df["Disease"].str.strip()
df = df[df["Disease"] != ""]
exclude = ["TOTAL", "PRIORITY", "Complications", "ANC,", "Foetal", "Neonatal", "Low Birth"]
mask = ~df["Disease"].str.contains("|".join(exclude), case=False, na=False)
df = df[mask].copy()

# Convert numeric columns
age_gender_cols = [c for c in clean_cols if c != "Disease"]
df[age_gender_cols] = df[age_gender_cols].apply(pd.to_numeric, errors="coerce").fillna(0)

print("✅ Loaded", len(df), "diseases | Columns:", len(age_gender_cols))

# ✅ 2. Build Sankey nodes & links
TOP_N = 6
top_diseases = df.nlargest(TOP_N, age_gender_cols)["Disease"].tolist()

# Unique age groups and genders
age_groups = sorted({col.split("_")[0] for col in age_gender_cols if "_" in col})
genders = ["M", "F"]

# Nodes: Disease → AgeGroup → Gender
nodes = (
    top_diseases +
    [f"AG: {ag}" for ag in age_groups] +
    [f"Gender: {g}" for g in genders]
)

node_idx = {name: i for i, name in enumerate(nodes)}

links = []

for _, row in df.iterrows():
    disease = row["Disease"]
    if disease not in top_diseases:
        continue
    d_i = node_idx[disease]
    
    for col in age_gender_cols:
        if "_" not in col:
            continue
        ag, g = col.split("_")
        if g not in genders:
            continue
        value = row[col]
        if value <= 0:
            continue
            
        try:
            ag_node = f"AG: {ag}"
            g_node = f"Gender: {g}"
            links.append({
                "source": d_i,
                "target": node_idx[ag_node],
                "value": value / 2  # split flow to reduce thickness
            })
            links.append({
                "source": node_idx[ag_node],
                "target": node_idx[g_node],
                "value": value
            })
        except KeyError:
            continue  # skip if node missing

print("✅ Generated", len(links), "links")

# ✅ 3. Plot Sankey
fig = go.Figure(go.Sankey(
    node=dict(
        label=nodes,
        color=["#636EFA"] * len(top_diseases) + 
              ["#EF553B"] * len(age_groups) + 
              ["#00CC96", "#FF6692"],  # M, F
        pad=12,
        thickness=20
    ),
    link=dict(
        source=[l["source"] for l in links],
        target=[l["target"] for l in links],
        value=[l["value"] for l in links],
        color="rgba(128,128,128,0.2)"
    )
))

fig.update_layout(
    title_text=f"Sankey Diagram: Top {TOP_N} Diseases → Age Group → Gender",
    font_size=12,
    height=700
)

# ✅ 4. Show
fig.show()
✅ Loaded 115 diseases | Columns: 20
✅ Generated 204 links
No description has been provided for this image

3D Representation¶

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Reuse cleaned df from above (or reload)
df = pd.read_csv("datasets/DataSet_CommonDiseases.csv", header=[0, 1])
cols = []
for i, (top, bot) in enumerate(df.columns):
    if i == 0:
        cols.append('Disease')
    else:
        if pd.isna(top) or pd.isna(bot):
            cols.append(f'Unnamed_{i}')
        else:
            ag = top.strip().replace(' Days', 'd').replace(' Months', 'm').replace(' Years', 'y')
            cols.append(f"{ag}_{bot.strip()}")
df.columns = cols

df = df[df['Disease'].notna() & (df['Disease'].str.strip() != '')]
df = df[~df['Disease'].str.contains('TOTAL|PRIORITY|Complications', case=False)]
age_gender_cols = [c for c in cols if '_' in c]
df[age_gender_cols] = df[age_gender_cols].apply(pd.to_numeric, errors='coerce').fillna(0)

# Aggregate M+F per age group
age_groups = sorted(set(col.split('_')[0] for col in age_gender_cols))
for ag in age_groups:
    cols_ag = [f"{ag}_M", f"{ag}_F"]
    cols_ag = [c for c in cols_ag if c in df.columns]
    df[ag] = df[cols_ag].sum(axis=1)

# Select top diseases
TOP_N = 20
top_df = df.nlargest(TOP_N, age_groups).copy()

# Map age groups to x positions
age_to_x = {ag: i for i, ag in enumerate(age_groups)}
x = []
y = []  # disease index
z = []
disease_labels = []
colors = []

for idx, (_, row) in enumerate(top_df.iterrows()):
    disease = row['Disease']
    for ag in age_groups:
        cases = row[ag]
        if cases > 0:
            x.append(age_to_x[ag])
            y.append(idx)
            z.append(cases)
            disease_labels.append(disease)
            colors.append(idx)  # same color per disease

# 3D Plot
fig = plt.figure(figsize=(14, 10))
ax = fig.add_subplot(111, projection='3d')

sc = ax.scatter(x, y, np.log10(np.array(z) + 1),  # log10 for scale
                c=colors, cmap='tab20', s=np.array(z)**0.5 * 5, alpha=0.8, edgecolor='k', linewidth=0.3)

ax.set_xticks(list(age_to_x.values()))
ax.set_xticklabels(list(age_to_x.keys()), rotation=30, ha='right')
ax.set_yticks(range(len(top_df)))
ax.set_yticklabels([name[:20] for name in top_df['Disease']], fontsize=9)
ax.set_zlabel('log₁₀(Cases + 1)')
ax.set_xlabel('Age Group')
ax.set_ylabel('Disease (Top 20)')

# Colorbar legend
cbar = plt.colorbar(sc, ax=ax, shrink=0.5, aspect=20, pad=0.1)
cbar.set_label('Disease Index')

ax.set_title('3D: Disease vs Age Group vs Case Count (log scale)', fontsize=14)
plt.subplots_adjust(left=0.35, right=0.75, top=0.92, bottom=0.08)
plt.show()
No description has been provided for this image

Scattered Plot¶

In [1]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset (adjust path if needed)
df = pd.read_csv("datasets/DataSet_CommonDiseases.csv", skiprows=2)  # skip metadata rows

# Clean up column names: replace NaN/multi-index artifacts
df.columns = [
    'Disease',
    '0-29d_M', '0-29d_F',
    '1-11m_M', '1-11m_F',
    '1-4y_M', '1-4y_F',
    '5-9y_M', '5-9y_F',
    '10-14y_M', '10-14y_F',
    '15-19y_M', '15-19y_F',
    '20-24y_M', '20-24y_F',
    '25-49y_M', '25-49y_F',
    '50-59y_M', '50-59y_F',
    '60+y_M', '60+y_F'
]

# Optional: Convert disease names to strings (some are numbers like 'TOTAL OLD CASES...')
df['Disease'] = df['Disease'].astype(str)

# Define age group labels (midpoints or categorical names)
age_groups = ['0-29d', '1-11m', '1-4y', '5-9y', '10-14y', '15-19y', '20-24y', '25-49y', '50-59y', '60+y']

# Melt the dataframe to long format for plotting
df_melted = df.melt(
    id_vars=['Disease'],
    value_vars=[col for col in df.columns if col != 'Disease'],
    var_name='AgeGender',
    value_name='Cases'
)

# Extract Age Group and Gender
df_melted['AgeGroup'] = df_melted['AgeGender'].str.split('_').str[0]
df_melted['Gender'] = df_melted['AgeGender'].str.split('_').str[1]
df_melted['Cases'] = pd.to_numeric(df_melted['Cases'], errors='coerce')  # handle non-numeric (e.g., empty/B)

# Drop rows with NaN cases
df_melted = df_melted.dropna(subset=['Cases'])

# Map age group to numeric position for x-axis
age_order = {grp: i for i, grp in enumerate(age_groups)}
df_melted['AgePos'] = df_melted['AgeGroup'].map(age_order)

# --------------------------
# 📊 SCATTER PLOT EXAMPLE: Compare Male vs Female cases per disease
# --------------------------

# Choose some example diseases to plot
diseases_of_interest = ['Diarrhoea', 'TuberculosisB', 'DiabetesB', 'Hypertension', 'Common Cold']

# Filter
filtered = df_melted[df_melted['Disease'].isin(diseases_of_interest)]

# Pivot to get Male and Female side-by-side
scatter_data = filtered.pivot_table(
    index=['Disease', 'AgeGroup', 'AgePos'],
    columns='Gender',
    values='Cases',
    fill_value=0
).reset_index()

# Plot
plt.figure(figsize=(12, 8))

colors = plt.cm.tab10(np.linspace(0, 1, len(diseases_of_interest)))

for i, disease in enumerate(diseases_of_interest):
    subset = scatter_data[scatter_data['Disease'] == disease]
    if 'M' in subset.columns and 'F' in subset.columns:
        plt.scatter(
            subset['M'],
            subset['F'],
            label=disease,
            s=60,
            alpha=0.7,
            color=colors[i],
            edgecolors='k',
            linewidth=0.5
        )
        # Optional: annotate age groups
        for _, row in subset.iterrows():
            plt.text(row['M'] + 1, row['F'] + 1, row['AgeGroup'], fontsize=7, color=colors[i])

plt.xlabel('Male Cases', fontsize=12)
plt.ylabel('Female Cases', fontsize=12)
plt.title('Scatter Plot: Male vs Female Cases by Disease & Age Group', fontsize=14)
plt.legend(title='Disease')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
No description has been provided for this image