[Sonam-Phuntsho] - Fab Futures - Data Science
Home About

Tried my first attempt to visualize dataset by watching youtube and help from Ai(deepseek) to get code.May be later i would able to put some of my own codes, right now i am able to undersatnd just few basic of coading. I tried to work with excel file but rememberd something about using csv file. Therefore converted xlsx file to csv file. The code given below is about csv file on health personel density. Prompt asked in deepseek

  1. How is the data analysed using jupyter notebook?
  2. Provide step wise data visualisation of csv file in simple terms.
  3. Also explain codes.

All the code used below for visualisation is from deepseek. But i felt it is utmost important to understand the code which i lack at the moment. I will burn mid-night candle to move forward to use my own code.

No description has been provided for this image

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [4]:
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
In [10]:
df = pd.read_csv("datasets/ahb_2023.csv", skiprows=1)
In [11]:
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
df.head()
Dataset Shape: (11, 9)

First few rows:
Out[11]:
sl.no Indicators 2017 2018 2019 2020 2021 2022 Source
0 1.0 Number of Doctors and density (per 10,000 popu... 345\n [4.3] 337\n [4.6] 318\n [4.32] 336\n[4.62] 354\n[4.64] 354\n[4.64] HRM, MoH
1 2.0 Number of Nurses and density (per 10,000 popul... 1264 [16.2] 1202 [16.5] 1364 [18.6] 1517\n[20.9] 1608\n[21.07] 1505\n[19.71] HRM, MoH
2 3.0 Number of Pharmacists and density (per 10,000 ... 36\n [0.5] 44\n [0.6] 43 \n[0.6] 42\n[0.6] 46\n[0.6] 45\n[0.6] HRM, MoH
3 4.0 Number HA,CO and BHW [density (per 10,000 popu... 636\n [8.1] 604 \n[8.3] 620 \n[8.4] 650\n[8.9] 683\n[8.95] 672\n[8.80] HRM, MoH
4 5.0 Number of Drungtshos (Indigenous Physicians) a... 55\n [0.7] 53\n [0.7] 54 \n[0.7] 52\n[0.7] 59\n[0.77] 79\n[1.03] HRM, MoH
In [12]:
df_clean = df.iloc[:7].copy()
df_clean.columns = ['sl_no', 'indicators', '2017', '2018', '2019', '2020', '2021', '2022', 'source']
df_clean.reset_index(drop=True, inplace=True)
print("Cleaned Data Shape:", df_clean.shape)
df_clean
Cleaned Data Shape: (7, 9)
Out[12]:
sl_no indicators 2017 2018 2019 2020 2021 2022 source
0 1.0 Number of Doctors and density (per 10,000 popu... 345\n [4.3] 337\n [4.6] 318\n [4.32] 336\n[4.62] 354\n[4.64] 354\n[4.64] HRM, MoH
1 2.0 Number of Nurses and density (per 10,000 popul... 1264 [16.2] 1202 [16.5] 1364 [18.6] 1517\n[20.9] 1608\n[21.07] 1505\n[19.71] HRM, MoH
2 3.0 Number of Pharmacists and density (per 10,000 ... 36\n [0.5] 44\n [0.6] 43 \n[0.6] 42\n[0.6] 46\n[0.6] 45\n[0.6] HRM, MoH
3 4.0 Number HA,CO and BHW [density (per 10,000 popu... 636\n [8.1] 604 \n[8.3] 620 \n[8.4] 650\n[8.9] 683\n[8.95] 672\n[8.80] HRM, MoH
4 5.0 Number of Drungtshos (Indigenous Physicians) a... 55\n [0.7] 53\n [0.7] 54 \n[0.7] 52\n[0.7] 59\n[0.77] 79\n[1.03] HRM, MoH
5 6.0 Number of sMenpas (Sowa Menpas) and density (p... 113 \n[1.4] 113 \n[1.5] 116 \n[1.6] 137\n[1.9] 146\n[1.9] 175\n[2.29] HRM, MoH
6 7.0 Number and distribution of health facilities (... 276\n [3.6] 279 \n[3.8] 288\n [3.9] 289\n[4.0] 289\n[3.85] 291\n[3.81] HMIS, MoH
In [13]:
def extract_values(text):
    if pd.isna(text):
        return np.nan, np.nan
    text = str(text)
    # Extract numbers and density values
    numbers = text.split('[')
    count = numbers[0].strip()
    density = numbers[1].replace(']', '').strip() if len(numbers) > 1 else np.nan
    return count, density
years = ['2017', '2018', '2019', '2020', '2021', '2022']
for year in years:
    # Extract count and density for each year
    df_clean[[f'{year}_count', f'{year}_density']] = df_clean[year].apply(
        lambda x: pd.Series(extract_values(x))
    )
    df_clean[f'{year}_count'] = pd.to_numeric(df_clean[f'{year}_count'], errors='coerce')
    df_clean[f'{year}_density'] = pd.to_numeric(df_clean[f'{year}_density'], errors='coerce')
print("Transformed Data Columns:")
print(df_clean.columns.tolist())
df_clean[['indicators'] + [f'{year}_count' for year in years] + [f'{year}_density' for year in years]].head()
Transformed Data Columns:
['sl_no', 'indicators', '2017', '2018', '2019', '2020', '2021', '2022', 'source', '2017_count', '2017_density', '2018_count', '2018_density', '2019_count', '2019_density', '2020_count', '2020_density', '2021_count', '2021_density', '2022_count', '2022_density']
Out[13]:
indicators 2017_count 2018_count 2019_count 2020_count 2021_count 2022_count 2017_density 2018_density 2019_density 2020_density 2021_density 2022_density
0 Number of Doctors and density (per 10,000 popu... 345 337 318 336 354 354 4.3 4.6 4.32 4.62 4.64 4.64
1 Number of Nurses and density (per 10,000 popul... 1264 1202 1364 1517 1608 1505 16.2 16.5 18.60 20.90 21.07 19.71
2 Number of Pharmacists and density (per 10,000 ... 36 44 43 42 46 45 0.5 0.6 0.60 0.60 0.60 0.60
3 Number HA,CO and BHW [density (per 10,000 popu... 636 604 620 650 683 672 8.1 8.3 8.40 8.90 8.95 8.80
4 Number of Drungtshos (Indigenous Physicians) a... 55 53 54 52 59 79 0.7 0.7 0.70 0.70 0.77 1.03
In [14]:
plt.figure(figsize=(14, 8))

indicators = df_clean['indicators'].tolist()
colors = plt.cm.Set3(np.linspace(0, 1, len(indicators)))

for i, indicator in enumerate(indicators):
    densities = [df_clean.loc[i, f'{year}_density'] for year in years]
    plt.plot(years, densities, marker='o', linewidth=2.5, markersize=6, 
             label=indicator.split('(')[0][:30] + '...', color=colors[i])

plt.title('Healthcare Workforce Density Trends (2017-2022)\nPer 10,000 Population', 
          fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Density (per 10,000 population)', fontsize=12)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9)
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [15]:
plt.figure(figsize=(12, 8))
densities_2022 = [df_clean.loc[i, '2022_density'] for i in range(len(df_clean))]
short_labels = [indicator.split('(')[0].strip()[:25] for indicator in df_clean['indicators']]

bars = plt.barh(short_labels, densities_2022, color=plt.cm.viridis(np.linspace(0, 1, len(short_labels))))
plt.title('Healthcare Workforce Density in 2022\n(Per 10,000 Population)', 
          fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Density (per 10,000 population)', fontsize=12)
for bar in bars:
    width = bar.get_width()
    plt.text(width + 0.1, bar.get_y() + bar.get_height()/2, 
             f'{width:.2f}', ha='left', va='center', fontweight='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [16]:
plt.figure(figsize=(14, 8))
workforce_data = []
for i in range(len(df_clean)):
    counts = [df_clean.loc[i, f'{year}_count'] for year in years]
    workforce_data.append(counts)

workforce_data = np.array(workforce_data)
short_labels = [indicator.split('(')[0].strip()[:20] for indicator in df_clean['indicators']]

plt.stackplot(years, workforce_data, labels=short_labels, alpha=0.8)
plt.title('Composition of Healthcare Workforce Over Time', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of Healthcare Workers', fontsize=12)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9)
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [26]:
plt.figure(figsize=(12, 8))
density_matrix = []
for i in range(len(df_clean)):
    row = [df_clean.loc[i, f'{year}_density'] for year in years]
    density_matrix.append(row)

density_matrix = np.array(density_matrix)
short_labels = [indicator.split('(')[0].strip()[:25] for indicator in df_clean['indicators']]

sns.heatmap(density_matrix, 
            xticklabels=years, 
            yticklabels=short_labels,
            annot=True, 
            fmt='.2f', 
            cmap='YlOrRd',
            cbar_kws={'label': 'Density (per 10,000)'})

plt.title('Healthcare Workforce Density Heatmap (2017-2022)', 
          fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Healthcare Workforce Category', fontsize=12)
plt.tight_layout()
plt.show()
No description has been provided for this image

Findings from heat map(Key Findings & Trends)¶

  1. Nursing Workforce (Most Significant) Strong growth: 16.20 (2017) → 19.71 (2022) - 21.7% increase.Peak in 2021 at 21.07, then slight decline in 2022. Nurses are the largest healthcare workforce group.

  2. Doctor Density - Stagnant Minimal growth: 4.30 (2017) → 4.64 (2022) - 7.9% increase onlyConsistently low throughout 6-year period. Critical shortage area - less than 5 doctors per 10,000 population.

  3. Pharmacists - No Growth Completely stagnant at 0.60 from 2018-2022.Lowest density among all healthcare professions

  4. Traditional Medicine Practitioners Drungshos: 0.70 → 1.03 (47% increase). sMempas: 1.40 → 2.29 (64% increase) Significant investment in traditional medicine workforce.Cultural integration in healthcare system

  5. Health Assistants/Community Workers Steady growth: 8.10 → 8.80 (8.6% increase).Stable backbone of community healthcare

Ratios (2022 data)¶

  1. nurse_to_doctor_ratio = 19.71 / 4.64 4.25 nurses per doctor
  2. community_health_ratio = 8.80 / 4.64 1.9 community workers per doctor
In [20]:
plt.figure(figsize=(12, 8))
percentage_changes = []
for i in range(len(df_clean)):
    density_2017 = df_clean.loc[i, '2017_density']
    density_2022 = df_clean.loc[i, '2022_density']
    if density_2017 > 0:
        pct_change = ((density_2022 - density_2017) / density_2017) * 100
    else:
        pct_change = 0
    percentage_changes.append(pct_change)
short_labels = [indicator.split('(')[0].strip()[:25] for indicator in df_clean['indicators']]
colors = ['green' if x >= 0 else 'red' for x in percentage_changes]
bars = plt.barh(short_labels, percentage_changes, color=colors, alpha=0.7)
plt.title('Percentage Change in Workforce Density (2022 vs 2017)', 
          fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Percentage Change (%)', fontsize=12)
plt.axvline(x=0, color='black', linestyle='-', alpha=0.3)
for bar in bars:
    width = bar.get_width()
    plt.text(width + (1 if width >= 0 else -5), bar.get_y() + bar.get_height()/2, 
             f'{width:.1f}%', ha='left' if width >= 0 else 'right', 
             va='center', fontweight='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 

Code Explanation for me to understand¶

  1. sns.set_palette("husl")

Sets the color palette for seaborn plots to "HUSL" (Hue-Saturation-Lightness) HUSL palette provides evenly spaced, perceptually uniform colors Good for categorical data with multiple distinct categories

  1. plt.style.use('seaborn-v0_8')

Applies the seaborn styling theme to matplotlib plots The v0_8 specifies compatibility with seaborn version 0.8+ This affects background colors, grid lines, font sizes, etc.

  1. %matplotlib inline

Jupyter/IPython magic command Displays plots directly in the notebook output cells Plots are rendered as static images below the code cells

  1. %config InlineBackend.figure_format = 'retina'

Configures matplotlib to use high-resolution ("retina") display Produces sharper, clearer plots with higher DPI Particularly beneficial for high-resolution displays

  1. The skiprows=1 parameter is very useful for cleaning up data files where the first row isn't part of the actual data structure but contains file metadata, descriptions, or comments.

  2. print("Dataset Shape:", df.shape) df.shape: Returns a tuple showing the dimensions of the DataFrame

Output format: (number_of_rows, number_of_columns)

This gives you a quick overview of how big your dataset is

  1. print("\nFirst few rows:") df.head() df.head(): Displays the first 5 rows of the DataFrame by default

\n: Creates a newline for better readability in the output

This shows you a sample of the actual data.

Why use these?¶

  1. df.shape - Quick size check to understand data volume

  2. df.head() - Visual inspection of data structure and values

  3. Combined - Get both overview and detailed look at the data

  4. df.head() # Shows first 5 rows

  5. head(10) # Shows first 10 rows

  6. df.sample(5) # Shows 5 random rows

  7. df.info() # Comprehensive info about the DataFrame

df_clean = df.iloc[:7].copy() df.iloc[:7]: Selects the first 7 rows (index 0-6) using integer-based indexing .copy(): Creates a deep copy to avoid modifying the original DataFrame Purpose: Extracts only the relevant rows needed for analysis

df_clean.columns = ['sl_no', 'indicators', '2017', '2018', '2019', '2020', '2021', '2022', 'source'] Renames all columns to more meaningful names Column structure: sl_no: Serial number indicators: Description of what each row represents 2017-2022: Year columns with data values source: Data source/reference

df_clean.reset_index(drop=True, inplace=True) reset_index(): Resets the row index to 0, 1, 2, 3... drop=True: Discards the old index (doesn't keep it as a column) inplace=True: Modifies the DataFrame directly instead of returning a new one

print("Cleaned Data Shape:", df_clean.shape) Shows the new dimensions after cleaning Should output: (7, 9) - 7 rows and 9 columns df_clean Displays the cleaned DataFrame in Jupyter notebook

Why each step is important: iloc[:7]: Filters to only relevant data rows .copy(): Prevents unexpected changes to original data Column renaming: Makes data more readable and analysis-friendly reset_index(): Creates clean, sequential indexing after subsetting Shape check: Verifies the cleaning operation worked as expected