Analysing the Annual Health Statistics of Bhutan¶

To start with, I had a lot of difficulty even getting my data onto this notebook, but I realised this was partly due to the organisation of the original data in the csv file. To address this, I've cut down a lot on the data I was planning on analysing.

Now, my data analysis will be on Mortality cases by diseases for last 5 years, Bhutan, extracted from the same dataset linked on Week 1.

However, I still faced lots of difficulties when it came to the data analysis process (which is still ongoing) since the export from xlsx to csv doesn't seem to have been very smooth...

Part of the process has been documented here (previous draft can be found here) but final analysis is only at the very end. Repeating the process all over again was great for learning, but took a very long time.

The ultimate data representation I was semi-satisfied is a treemap.

Obtaining the data¶

The following are the ChatGPT prompts used in this first part:

I need help in uploading and accessing a .csv file on Jupyter Notebooks. Please advise on how to do so.
What is pandas?
Where can I find the path to my data set?

In [1]:

import pandas as pd

df = pd.read_csv("datasets/Mortality cases3.csv")
df.head(5)

Out[1]:

	ICD10 CODE	Name of the Disease	Type of disease	2018	2019	2020	2021	2022
0	A02ᴳ	Diarrhoea	Infectious	6.0	6.0	2.0	NaN	2.0
1	A03ᴳ	Dysentery	Infectious	NaN	NaN	NaN	NaN	NaN
2	A15ᴳ	Tuberculosis	Infectious	22.0	20.0	20.0	31.0	17.0
3	A41ᴳ	Other Sepsis, including Septicaemia	Infectious	62.0	46.0	52.0	32.0	45.0
4	A50	Congenital Syphilis	Infectious	NaN	NaN	NaN	NaN	NaN

In [2]:

df.columns

Out[2]:

Index(['ICD10 CODE', 'Name of the Disease', 'Type of disease', '2018', '2019',
       '2020', '2021', '2022'],
      dtype='object')

In [3]:

years = ['2018', '2019', '2020', '2021', '2022']
df[years] = df[years].apply(pd.to_numeric, errors='coerce') # the "pd.to_numeric" part ensures that data is read as numbers

As an afterthought as I started my data analysis, I felt it was important to highlight the variables

ICD10 CODE
Name of the Disease
Type of disease
2018
2019
2020
2021
2022

Data Analysis¶

Now (for the second time) I'm thinking my data is finally ready to be analysed, so I asked ChatGPT for help with creating an interactive visualisation on Plotly that would allow someone to select a disease and see information on it accordingly, with the context that I'd like to use it with students.

Some the code below was taken from ChatGPT/Gemini (as referenced) — I also used https://plotly.com/python/plotly-express/ as a reference

In [4]:

import plotly.express as px

In [5]:

import plotly.io as pio # to enable plotly rendering
pio.renderers.default = "notebook"   # or notebook_connected - this is from chatGPT

I don't really understand the part below, but I wasn't able to create any graphics - both AI platforms noted that I had to "Make the data 'long format'" if I wanted interactive visuals, so ChatGPT provided the following code:

(this is also the part I really struggled with in my abandoned week02 notebook)

In [6]:

df_long = df.melt(
    id_vars=['ICD10 CODE', 'Name of the Disease', 'Type of disease'],
    value_vars=years,
    var_name='Year',
    value_name='DeathCount'
)

df_long.head()

Out[6]:

	ICD10 CODE	Name of the Disease	Type of disease	Year	DeathCount
0	A02ᴳ	Diarrhoea	Infectious	2018	6.0
1	A03ᴳ	Dysentery	Infectious	2018	NaN
2	A15ᴳ	Tuberculosis	Infectious	2018	22.0
3	A41ᴳ	Other Sepsis, including Septicaemia	Infectious	2018	62.0
4	A50	Congenital Syphilis	Infectious	2018	NaN

Assuming I have succesfully "melted the DataFrame", I experimented with code as made available on the plotly website, with additional help from ChatGPT: for example, one of my prompts was:

I want to use the following code for my analysis. df = px.data.iris() fig = px.scatter(df, x="Year", y="DeathCount", color="????") fig.show() What is the "color" part?

This helped me gain a better sense of the different visualisation options.

Attempt 1:¶

In [7]:

df = df_long
fig = px.scatter(df_long, x="Year", y="DeathCount", color="Name of the Disease")
fig.show()

Sort of interesting, but doesn't seem to tell me much ...

Attempt 2:¶

In [8]:

fig = px.line(
    df_long,
    x="Year",
    y="DeathCount",
    color="Name of the Disease",
    markers=True
)

fig.show()

Same thing, just connected by a line?

Attempt 3:¶

Got the following code from AI to understand how to use python with plotly to construct a Sankey

flow_df = df.groupby(['parental_education_level', 'family_income_range', 'dropout_risk']).size().reset_index(name='count')```

fig = go.Figure(data=[go.Sankey(\n # ... node and link definitions ...\n)]) fig.show()

But for now, upon further reflection, a Sankey also doesn't feel like the best type of representation for the data I have - the changes from year to year could be applicable, but I felt lke it would create something quite messy...

In [ ]:

Attempt 4:¶

Trying out animations

To figure out the ranges and size, I just looked at my data, but this felt like a rather ineffecient way to do it. Maybe better to find how to code an automatic range finder?

In [9]:

df_long['DeathCount'].max() # according to ChatGPT, this would be how to find the max number in a given year

Out[9]:

166.0

In [10]:

fig = px.bar(
    df_long, 
    x="Type of disease", 
    y="DeathCount", 
    color="Type of disease",
    animation_frame="Year", 
    animation_group="Name of the Disease", 
    range_y=[0,200]
)

fig.update_layout(
    width=1000,
    height=700, 
    xaxis=dict(tickfont=dict(size=10)),
    yaxis=dict(tickfont=dict(size=10))
)

fig.show()

Attempt 4.1¶

This is obviously very ugly - so I asked AI to help me with the following prompt

this code worked, i.e. it came up with an interactive plot. however, the formatting is very messy, as the "types of diseases" are very long and the bar graphs themselves are very short. how do i format it? please explain the answer provided.

fig = px.bar(df_long, x="Type of disease", y="DeathCount", color="Type of disease", animation_frame="Year", animation_group="Name of the Disease", range_y=[0,200]) fig.show()

The first suggestion was a horizontal bar graph (I asked them to hide the legend, and provide different tips for the formatting.)

In [11]:

fig = px.bar(
    df_long,
    x="DeathCount",
    y="Type of disease",
    color="Type of disease",
    animation_frame="Year",
    animation_group="Name of the Disease",
    range_x=[0, 200]
)

fig.update_layout(
    yaxis={'categoryorder': 'total ascending'}, #optional: sorts diseases - but how do I also change the y-axis font?
    
)

fig.show()

Attempt 5¶

Assuming interactive bar graphs are not the best way to represent this data, I looked through plotly's resource again to see what else could be a good representation

This was initially the code I'd used - but after repeated errors, I asked AI for advice on formatting

import numpy as np df = px.data.gapminder().query("Year == 2022") fig = px.treemap(df, path=[px.Constant('world'), 'Type of disease', 'Name of the Disease'], values='DeathCount', color='Name of the Disease', hover_data=['DeathCount']) #apparently hover_data means ? fig.show()

In [12]:

print(df_long.columns) #due to repeated errors...

Index(['ICD10 CODE', 'Name of the Disease', 'Type of disease', 'Year',
       'DeathCount'],
      dtype='object')

In [13]:

# 1. Use YOUR dataframe (df_long), not the sample one (gapminder) - this part confused me, but I realised I had made the same error ebove
# 2. Filter for 2022 (since the data doesn't have 2023) - this was another silly mistake on my part
df_long['Year'] = df_long['Year'].astype(int)
df_2022 = df_long.query("Year == 2022")

# 3. Create the Treemap - I much prefer this formatting to the one provided on the plotly website
fig = px.treemap(
    df_2022, 
    path=[px.Constant('Bhutan'), 'Type of disease', 'Name of the Disease'], 
    values='DeathCount',
    color='Type of disease', # Coloring by Category looks less messy
    hover_data=['DeathCount'],
    title="Mortality Cases by Disease Type (2022)"
) 
# 4. Realised that using this sizing actually helped it render correctly - again, not sure why 
fig.update_layout(
    width=1000,
    height=700, 
)

fig.show()

I was very afraid that I would mess this up, it seems to be the best representation so far -- so I thought I'd do it again, but with some formatting tweaks.

In [14]:

df_long['Year'] = df_long['Year'].astype(int)
df_2022 = df_long.query("Year == 2022")

fig = px.treemap(
    df_2022, 
    path=[px.Constant('Bhutan'), 'Type of disease', 'Name of the Disease'], 
    values='DeathCount',
    color='Type of disease', 
    hover_data=['DeathCount'],
    title="Mortality Cases by Disease Type (2022)"
) 

fig.update_layout(font=dict(size=18))
fig.update_layout(uniformtext=dict(minsize=8, mode='hide'))
fig.update_layout(
    width=1000,
    height=700, 
)

fig.show()

Definitely not much better, since you can't see the text... but maybe less messy?

Hypothetical attempt 6:¶

While scrolling on plotly's resource, I thought it would be super cool to map this data onto a Bhutan map - but maybe I will do this when I have more time!