Analysing the Annual Health Statistics of Bhutan¶
To start with, I had a lot of difficulty even getting my data onto this notebook, but I realised this was partly due to the organisation of the original data in the csv file. To address this, I've cut down a lot on the data I was planning on analysing.
Now, my data analysis will be on Mortality cases by diseases for last 5 years, Bhutan, extracted from the same dataset linked on Week 1.
However, I still faced lots of difficulties when it came to the data analysis process (which is still ongoing) since the export from xlsx to csv doesn't seem to have been very smooth...
Part of the process has been documented here (previous draft can be found here) but final analysis is only at the very end. Repeating the process all over again was great for learning, but took a very long time.
The ultimate data representation I was semi-satisfied is a treemap.
Obtaining the data¶
The following are the ChatGPT prompts used in this first part:
- I need help in uploading and accessing a .csv file on Jupyter Notebooks. Please advise on how to do so.
- What is pandas?
- Where can I find the path to my data set?
import pandas as pd
df = pd.read_csv("datasets/Mortality cases3.csv")
df.head(5)
| ICD10 CODE | Name of the Disease | Type of disease | 2018 | 2019 | 2020 | 2021 | 2022 | |
|---|---|---|---|---|---|---|---|---|
| 0 | A02ᴳ | Diarrhoea | Infectious | 6.0 | 6.0 | 2.0 | NaN | 2.0 |
| 1 | A03ᴳ | Dysentery | Infectious | NaN | NaN | NaN | NaN | NaN |
| 2 | A15ᴳ | Tuberculosis | Infectious | 22.0 | 20.0 | 20.0 | 31.0 | 17.0 |
| 3 | A41ᴳ | Other Sepsis, including Septicaemia | Infectious | 62.0 | 46.0 | 52.0 | 32.0 | 45.0 |
| 4 | A50 | Congenital Syphilis | Infectious | NaN | NaN | NaN | NaN | NaN |
df.columns
Index(['ICD10 CODE', 'Name of the Disease', 'Type of disease', '2018', '2019',
'2020', '2021', '2022'],
dtype='object')
years = ['2018', '2019', '2020', '2021', '2022']
df[years] = df[years].apply(pd.to_numeric, errors='coerce') # the "pd.to_numeric" part ensures that data is read as numbers
As an afterthought as I started my data analysis, I felt it was important to highlight the variables
- ICD10 CODE
- Name of the Disease
- Type of disease
- 2018
- 2019
- 2020
- 2021
- 2022
Data Analysis¶
Now (for the second time) I'm thinking my data is finally ready to be analysed, so I asked ChatGPT for help with creating an interactive visualisation on Plotly that would allow someone to select a disease and see information on it accordingly, with the context that I'd like to use it with students.
Some the code below was taken from ChatGPT/Gemini (as referenced) — I also used https://plotly.com/python/plotly-express/ as a reference
import plotly.express as px
import plotly.io as pio # to enable plotly rendering
pio.renderers.default = "notebook" # or notebook_connected - this is from chatGPT
I don't really understand the part below, but I wasn't able to create any graphics - both AI platforms noted that I had to "Make the data 'long format'" if I wanted interactive visuals, so ChatGPT provided the following code:
(this is also the part I really struggled with in my abandoned week02 notebook)
df_long = df.melt(
id_vars=['ICD10 CODE', 'Name of the Disease', 'Type of disease'],
value_vars=years,
var_name='Year',
value_name='DeathCount'
)
df_long.head()
| ICD10 CODE | Name of the Disease | Type of disease | Year | DeathCount | |
|---|---|---|---|---|---|
| 0 | A02ᴳ | Diarrhoea | Infectious | 2018 | 6.0 |
| 1 | A03ᴳ | Dysentery | Infectious | 2018 | NaN |
| 2 | A15ᴳ | Tuberculosis | Infectious | 2018 | 22.0 |
| 3 | A41ᴳ | Other Sepsis, including Septicaemia | Infectious | 2018 | 62.0 |
| 4 | A50 | Congenital Syphilis | Infectious | 2018 | NaN |
Assuming I have succesfully "melted the DataFrame", I experimented with code as made available on the plotly website, with additional help from ChatGPT: for example, one of my prompts was:
I want to use the following code for my analysis. df = px.data.iris() fig = px.scatter(df, x="Year", y="DeathCount", color="????") fig.show() What is the "color" part?
This helped me gain a better sense of the different visualisation options.
Attempt 1:¶
df = df_long
fig = px.scatter(df_long, x="Year", y="DeathCount", color="Name of the Disease")
fig.show()
Sort of interesting, but doesn't seem to tell me much ...
Attempt 2:¶
fig = px.line(
df_long,
x="Year",
y="DeathCount",
color="Name of the Disease",
markers=True
)
fig.show()
Same thing, just connected by a line?
Attempt 3:¶
Got the following code from AI to understand how to use python with plotly to construct a Sankey
flow_df = df.groupby(['parental_education_level', 'family_income_range', 'dropout_risk']).size().reset_index(name='count')```
fig = go.Figure(data=[go.Sankey(\n # ... node and link definitions ...\n)]) fig.show()
But for now, upon further reflection, a Sankey also doesn't feel like the best type of representation for the data I have - the changes from year to year could be applicable, but I felt lke it would create something quite messy...
Attempt 4:¶
Trying out animations
To figure out the ranges and size, I just looked at my data, but this felt like a rather ineffecient way to do it. Maybe better to find how to code an automatic range finder?
df_long['DeathCount'].max() # according to ChatGPT, this would be how to find the max number in a given year
166.0
fig = px.bar(
df_long,
x="Type of disease",
y="DeathCount",
color="Type of disease",
animation_frame="Year",
animation_group="Name of the Disease",
range_y=[0,200]
)
fig.update_layout(
width=1000,
height=700,
xaxis=dict(tickfont=dict(size=10)),
yaxis=dict(tickfont=dict(size=10))
)
fig.show()
Attempt 4.1¶
This is obviously very ugly - so I asked AI to help me with the following prompt
this code worked, i.e. it came up with an interactive plot. however, the formatting is very messy, as the "types of diseases" are very long and the bar graphs themselves are very short. how do i format it? please explain the answer provided.
fig = px.bar(df_long, x="Type of disease", y="DeathCount", color="Type of disease", animation_frame="Year", animation_group="Name of the Disease", range_y=[0,200]) fig.show()
The first suggestion was a horizontal bar graph (I asked them to hide the legend, and provide different tips for the formatting.)
fig = px.bar(
df_long,
x="DeathCount",
y="Type of disease",
color="Type of disease",
animation_frame="Year",
animation_group="Name of the Disease",
range_x=[0, 200]
)
fig.update_layout(
yaxis={'categoryorder': 'total ascending'}, #optional: sorts diseases - but how do I also change the y-axis font?
)
fig.show()
Attempt 5¶
Assuming interactive bar graphs are not the best way to represent this data, I looked through plotly's resource again to see what else could be a good representation
This was initially the code I'd used - but after repeated errors, I asked AI for advice on formatting
import numpy as np df = px.data.gapminder().query("Year == 2022") fig = px.treemap(df, path=[px.Constant('world'), 'Type of disease', 'Name of the Disease'], values='DeathCount', color='Name of the Disease', hover_data=['DeathCount']) #apparently hover_data means ? fig.show()
print(df_long.columns) #due to repeated errors...
Index(['ICD10 CODE', 'Name of the Disease', 'Type of disease', 'Year',
'DeathCount'],
dtype='object')
# 1. Use YOUR dataframe (df_long), not the sample one (gapminder) - this part confused me, but I realised I had made the same error ebove
# 2. Filter for 2022 (since the data doesn't have 2023) - this was another silly mistake on my part
df_long['Year'] = df_long['Year'].astype(int)
df_2022 = df_long.query("Year == 2022")
# 3. Create the Treemap - I much prefer this formatting to the one provided on the plotly website
fig = px.treemap(
df_2022,
path=[px.Constant('Bhutan'), 'Type of disease', 'Name of the Disease'],
values='DeathCount',
color='Type of disease', # Coloring by Category looks less messy
hover_data=['DeathCount'],
title="Mortality Cases by Disease Type (2022)"
)
# 4. Realised that using this sizing actually helped it render correctly - again, not sure why
fig.update_layout(
width=1000,
height=700,
)
fig.show()
I was very afraid that I would mess this up, it seems to be the best representation so far -- so I thought I'd do it again, but with some formatting tweaks.
df_long['Year'] = df_long['Year'].astype(int)
df_2022 = df_long.query("Year == 2022")
fig = px.treemap(
df_2022,
path=[px.Constant('Bhutan'), 'Type of disease', 'Name of the Disease'],
values='DeathCount',
color='Type of disease',
hover_data=['DeathCount'],
title="Mortality Cases by Disease Type (2022)"
)
fig.update_layout(font=dict(size=18))
fig.update_layout(uniformtext=dict(minsize=8, mode='hide'))
fig.update_layout(
width=1000,
height=700,
)
fig.show()
Definitely not much better, since you can't see the text... but maybe less messy?
Hypothetical attempt 6:¶
While scrolling on plotly's resource, I thought it would be super cool to map this data onto a Bhutan map - but maybe I will do this when I have more time!