Tried my first attempt to visualize dataset by watching youtube and help from Ai(deepseek) to get code.May be later i would able to put some of my own codes, right now i am able to undersatnd just few basic of coading. I tried to work with excel file but rememberd something about using csv file. Therefore converted xlsx file to csv file. The code given below is about csv file on health personel density. Prompt asked in deepseek
- How is the data analysed using jupyter notebook?
- Provide step wise data visualisation of csv file in simple terms.
- Also explain codes.
All the code used below for visualisation is from deepseek. But i felt it is utmost important to understand the code which i lack at the moment. I will burn mid-night candle to move forward to use my own code.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
df = pd.read_csv("datasets/ahb_2023.csv", skiprows=1)
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
df.head()
Dataset Shape: (11, 9) First few rows:
| sl.no | Indicators | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | Source | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | Number of Doctors and density (per 10,000 popu... | 345\n [4.3] | 337\n [4.6] | 318\n [4.32] | 336\n[4.62] | 354\n[4.64] | 354\n[4.64] | HRM, MoH |
| 1 | 2.0 | Number of Nurses and density (per 10,000 popul... | 1264 [16.2] | 1202 [16.5] | 1364 [18.6] | 1517\n[20.9] | 1608\n[21.07] | 1505\n[19.71] | HRM, MoH |
| 2 | 3.0 | Number of Pharmacists and density (per 10,000 ... | 36\n [0.5] | 44\n [0.6] | 43 \n[0.6] | 42\n[0.6] | 46\n[0.6] | 45\n[0.6] | HRM, MoH |
| 3 | 4.0 | Number HA,CO and BHW [density (per 10,000 popu... | 636\n [8.1] | 604 \n[8.3] | 620 \n[8.4] | 650\n[8.9] | 683\n[8.95] | 672\n[8.80] | HRM, MoH |
| 4 | 5.0 | Number of Drungtshos (Indigenous Physicians) a... | 55\n [0.7] | 53\n [0.7] | 54 \n[0.7] | 52\n[0.7] | 59\n[0.77] | 79\n[1.03] | HRM, MoH |
df_clean = df.iloc[:7].copy()
df_clean.columns = ['sl_no', 'indicators', '2017', '2018', '2019', '2020', '2021', '2022', 'source']
df_clean.reset_index(drop=True, inplace=True)
print("Cleaned Data Shape:", df_clean.shape)
df_clean
Cleaned Data Shape: (7, 9)
| sl_no | indicators | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | source | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | Number of Doctors and density (per 10,000 popu... | 345\n [4.3] | 337\n [4.6] | 318\n [4.32] | 336\n[4.62] | 354\n[4.64] | 354\n[4.64] | HRM, MoH |
| 1 | 2.0 | Number of Nurses and density (per 10,000 popul... | 1264 [16.2] | 1202 [16.5] | 1364 [18.6] | 1517\n[20.9] | 1608\n[21.07] | 1505\n[19.71] | HRM, MoH |
| 2 | 3.0 | Number of Pharmacists and density (per 10,000 ... | 36\n [0.5] | 44\n [0.6] | 43 \n[0.6] | 42\n[0.6] | 46\n[0.6] | 45\n[0.6] | HRM, MoH |
| 3 | 4.0 | Number HA,CO and BHW [density (per 10,000 popu... | 636\n [8.1] | 604 \n[8.3] | 620 \n[8.4] | 650\n[8.9] | 683\n[8.95] | 672\n[8.80] | HRM, MoH |
| 4 | 5.0 | Number of Drungtshos (Indigenous Physicians) a... | 55\n [0.7] | 53\n [0.7] | 54 \n[0.7] | 52\n[0.7] | 59\n[0.77] | 79\n[1.03] | HRM, MoH |
| 5 | 6.0 | Number of sMenpas (Sowa Menpas) and density (p... | 113 \n[1.4] | 113 \n[1.5] | 116 \n[1.6] | 137\n[1.9] | 146\n[1.9] | 175\n[2.29] | HRM, MoH |
| 6 | 7.0 | Number and distribution of health facilities (... | 276\n [3.6] | 279 \n[3.8] | 288\n [3.9] | 289\n[4.0] | 289\n[3.85] | 291\n[3.81] | HMIS, MoH |
def extract_values(text):
if pd.isna(text):
return np.nan, np.nan
text = str(text)
# Extract numbers and density values
numbers = text.split('[')
count = numbers[0].strip()
density = numbers[1].replace(']', '').strip() if len(numbers) > 1 else np.nan
return count, density
years = ['2017', '2018', '2019', '2020', '2021', '2022']
for year in years:
# Extract count and density for each year
df_clean[[f'{year}_count', f'{year}_density']] = df_clean[year].apply(
lambda x: pd.Series(extract_values(x))
)
df_clean[f'{year}_count'] = pd.to_numeric(df_clean[f'{year}_count'], errors='coerce')
df_clean[f'{year}_density'] = pd.to_numeric(df_clean[f'{year}_density'], errors='coerce')
print("Transformed Data Columns:")
print(df_clean.columns.tolist())
df_clean[['indicators'] + [f'{year}_count' for year in years] + [f'{year}_density' for year in years]].head()
Transformed Data Columns: ['sl_no', 'indicators', '2017', '2018', '2019', '2020', '2021', '2022', 'source', '2017_count', '2017_density', '2018_count', '2018_density', '2019_count', '2019_density', '2020_count', '2020_density', '2021_count', '2021_density', '2022_count', '2022_density']
| indicators | 2017_count | 2018_count | 2019_count | 2020_count | 2021_count | 2022_count | 2017_density | 2018_density | 2019_density | 2020_density | 2021_density | 2022_density | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Number of Doctors and density (per 10,000 popu... | 345 | 337 | 318 | 336 | 354 | 354 | 4.3 | 4.6 | 4.32 | 4.62 | 4.64 | 4.64 |
| 1 | Number of Nurses and density (per 10,000 popul... | 1264 | 1202 | 1364 | 1517 | 1608 | 1505 | 16.2 | 16.5 | 18.60 | 20.90 | 21.07 | 19.71 |
| 2 | Number of Pharmacists and density (per 10,000 ... | 36 | 44 | 43 | 42 | 46 | 45 | 0.5 | 0.6 | 0.60 | 0.60 | 0.60 | 0.60 |
| 3 | Number HA,CO and BHW [density (per 10,000 popu... | 636 | 604 | 620 | 650 | 683 | 672 | 8.1 | 8.3 | 8.40 | 8.90 | 8.95 | 8.80 |
| 4 | Number of Drungtshos (Indigenous Physicians) a... | 55 | 53 | 54 | 52 | 59 | 79 | 0.7 | 0.7 | 0.70 | 0.70 | 0.77 | 1.03 |
plt.figure(figsize=(14, 8))
indicators = df_clean['indicators'].tolist()
colors = plt.cm.Set3(np.linspace(0, 1, len(indicators)))
for i, indicator in enumerate(indicators):
densities = [df_clean.loc[i, f'{year}_density'] for year in years]
plt.plot(years, densities, marker='o', linewidth=2.5, markersize=6,
label=indicator.split('(')[0][:30] + '...', color=colors[i])
plt.title('Healthcare Workforce Density Trends (2017-2022)\nPer 10,000 Population',
fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Density (per 10,000 population)', fontsize=12)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9)
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
plt.figure(figsize=(12, 8))
densities_2022 = [df_clean.loc[i, '2022_density'] for i in range(len(df_clean))]
short_labels = [indicator.split('(')[0].strip()[:25] for indicator in df_clean['indicators']]
bars = plt.barh(short_labels, densities_2022, color=plt.cm.viridis(np.linspace(0, 1, len(short_labels))))
plt.title('Healthcare Workforce Density in 2022\n(Per 10,000 Population)',
fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Density (per 10,000 population)', fontsize=12)
for bar in bars:
width = bar.get_width()
plt.text(width + 0.1, bar.get_y() + bar.get_height()/2,
f'{width:.2f}', ha='left', va='center', fontweight='bold')
plt.tight_layout()
plt.show()
plt.figure(figsize=(14, 8))
workforce_data = []
for i in range(len(df_clean)):
counts = [df_clean.loc[i, f'{year}_count'] for year in years]
workforce_data.append(counts)
workforce_data = np.array(workforce_data)
short_labels = [indicator.split('(')[0].strip()[:20] for indicator in df_clean['indicators']]
plt.stackplot(years, workforce_data, labels=short_labels, alpha=0.8)
plt.title('Composition of Healthcare Workforce Over Time', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of Healthcare Workers', fontsize=12)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9)
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
plt.figure(figsize=(12, 8))
density_matrix = []
for i in range(len(df_clean)):
row = [df_clean.loc[i, f'{year}_density'] for year in years]
density_matrix.append(row)
density_matrix = np.array(density_matrix)
short_labels = [indicator.split('(')[0].strip()[:25] for indicator in df_clean['indicators']]
sns.heatmap(density_matrix,
xticklabels=years,
yticklabels=short_labels,
annot=True,
fmt='.2f',
cmap='YlOrRd',
cbar_kws={'label': 'Density (per 10,000)'})
plt.title('Healthcare Workforce Density Heatmap (2017-2022)',
fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Healthcare Workforce Category', fontsize=12)
plt.tight_layout()
plt.show()
Findings from heat map(Key Findings & Trends)¶
Nursing Workforce (Most Significant) Strong growth: 16.20 (2017) → 19.71 (2022) - 21.7% increase.Peak in 2021 at 21.07, then slight decline in 2022. Nurses are the largest healthcare workforce group.
Doctor Density - Stagnant Minimal growth: 4.30 (2017) → 4.64 (2022) - 7.9% increase onlyConsistently low throughout 6-year period. Critical shortage area - less than 5 doctors per 10,000 population.
Pharmacists - No Growth Completely stagnant at 0.60 from 2018-2022.Lowest density among all healthcare professions
Traditional Medicine Practitioners Drungshos: 0.70 → 1.03 (47% increase). sMempas: 1.40 → 2.29 (64% increase) Significant investment in traditional medicine workforce.Cultural integration in healthcare system
Health Assistants/Community Workers Steady growth: 8.10 → 8.80 (8.6% increase).Stable backbone of community healthcare
Ratios (2022 data)¶
- nurse_to_doctor_ratio = 19.71 / 4.64 4.25 nurses per doctor
- community_health_ratio = 8.80 / 4.64 1.9 community workers per doctor
plt.figure(figsize=(12, 8))
percentage_changes = []
for i in range(len(df_clean)):
density_2017 = df_clean.loc[i, '2017_density']
density_2022 = df_clean.loc[i, '2022_density']
if density_2017 > 0:
pct_change = ((density_2022 - density_2017) / density_2017) * 100
else:
pct_change = 0
percentage_changes.append(pct_change)
short_labels = [indicator.split('(')[0].strip()[:25] for indicator in df_clean['indicators']]
colors = ['green' if x >= 0 else 'red' for x in percentage_changes]
bars = plt.barh(short_labels, percentage_changes, color=colors, alpha=0.7)
plt.title('Percentage Change in Workforce Density (2022 vs 2017)',
fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Percentage Change (%)', fontsize=12)
plt.axvline(x=0, color='black', linestyle='-', alpha=0.3)
for bar in bars:
width = bar.get_width()
plt.text(width + (1 if width >= 0 else -5), bar.get_y() + bar.get_height()/2,
f'{width:.1f}%', ha='left' if width >= 0 else 'right',
va='center', fontweight='bold')
plt.tight_layout()
plt.show()
Code Explanation for me to understand¶
- sns.set_palette("husl")
Sets the color palette for seaborn plots to "HUSL" (Hue-Saturation-Lightness) HUSL palette provides evenly spaced, perceptually uniform colors Good for categorical data with multiple distinct categories
- plt.style.use('seaborn-v0_8')
Applies the seaborn styling theme to matplotlib plots The v0_8 specifies compatibility with seaborn version 0.8+ This affects background colors, grid lines, font sizes, etc.
- %matplotlib inline
Jupyter/IPython magic command Displays plots directly in the notebook output cells Plots are rendered as static images below the code cells
- %config InlineBackend.figure_format = 'retina'
Configures matplotlib to use high-resolution ("retina") display Produces sharper, clearer plots with higher DPI Particularly beneficial for high-resolution displays
The skiprows=1 parameter is very useful for cleaning up data files where the first row isn't part of the actual data structure but contains file metadata, descriptions, or comments.
print("Dataset Shape:", df.shape) df.shape: Returns a tuple showing the dimensions of the DataFrame
Output format: (number_of_rows, number_of_columns)
This gives you a quick overview of how big your dataset is
- print("\nFirst few rows:") df.head() df.head(): Displays the first 5 rows of the DataFrame by default
\n: Creates a newline for better readability in the output
This shows you a sample of the actual data.
Why use these?¶
df.shape - Quick size check to understand data volume
df.head() - Visual inspection of data structure and values
Combined - Get both overview and detailed look at the data
df.head() # Shows first 5 rows
head(10) # Shows first 10 rows
df.sample(5) # Shows 5 random rows
df.info() # Comprehensive info about the DataFrame
df_clean = df.iloc[:7].copy() df.iloc[:7]: Selects the first 7 rows (index 0-6) using integer-based indexing .copy(): Creates a deep copy to avoid modifying the original DataFrame Purpose: Extracts only the relevant rows needed for analysis
df_clean.columns = ['sl_no', 'indicators', '2017', '2018', '2019', '2020', '2021', '2022', 'source'] Renames all columns to more meaningful names Column structure: sl_no: Serial number indicators: Description of what each row represents 2017-2022: Year columns with data values source: Data source/reference
df_clean.reset_index(drop=True, inplace=True) reset_index(): Resets the row index to 0, 1, 2, 3... drop=True: Discards the old index (doesn't keep it as a column) inplace=True: Modifies the DataFrame directly instead of returning a new one
print("Cleaned Data Shape:", df_clean.shape) Shows the new dimensions after cleaning Should output: (7, 9) - 7 rows and 9 columns df_clean Displays the cleaned DataFrame in Jupyter notebook
Why each step is important: iloc[:7]: Filters to only relevant data rows .copy(): Prevents unexpected changes to original data Column renaming: Makes data more readable and analysis-friendly reset_index(): Creates clean, sequential indexing after subsetting Shape check: Verifies the cleaning operation worked as expected