Desel P. Dorji - Fab Futures - Data Science
Home About

A previous draft that I'm keeping for notes ... but it is full of errors and references to different data sets so I moved to a completely new notebook¶

Obtaining the data¶

The following are the ChatGPT prompts used in this first part:

  • I need help in uploading and accessing a .csv file on Jupyter Notebooks. Please advise on how to do so.
  • What is pandas?
  • Where can I find the path to my data set?
In [32]:
import pandas as pd

df = pd.read_csv("datasets/Mortality cases.csv")
df.head(5)
Out[32]:
Table 2.8: Mortality cases by diseases for last 5 years, Bhutan Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
0 ICD10 CODE Name of the Disease Year NaN NaN NaN NaN
1 NaN NaN 2018 2019.0 2020.0 2021.0 2022.0
2 Infectious Diseases NaN NaN NaN NaN NaN NaN
3 A02á´³ Diarrhoea 6 6.0 2.0 NaN 2.0
4 A03á´³ Dysentery NaN NaN NaN NaN NaN

Clearing up the data¶

From the previous table created, I realised that my csv file was looking quite difficult to decipher. I asked ChatGPT to help me with "cleaning my csv file". The following code is from ChatGPT based on how it recommended I clean the file.

In [16]:
df.columns
Out[16]:
Index(['Table 2.8: Mortality cases by diseases for last 5 years, Bhutan',
       'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5',
       'Unnamed: 6'],
      dtype='object')
In [33]:
file_path = "datasets/Mortality cases.csv"

df_raw = pd.read_csv(file_path, header=None)   # load without assuming a header
df_raw.head(5)
Out[33]:
0 1 2 3 4 5 6
0 Table 2.8: Mortality cases by diseases for las... NaN NaN NaN NaN NaN NaN
1 ICD10 CODE Name of the Disease Year NaN NaN NaN NaN
2 NaN NaN 2018 2019.0 2020.0 2021.0 2022.0
3 Infectious Diseases NaN NaN NaN NaN NaN NaN
4 A02á´³ Diarrhoea 6 6.0 2.0 NaN 2.0
In [23]:
header_row = 2 # Choosing this header as it contains the years ... not sure if the name of the disease and the ICD10 Code will be accounted for? 

df = pd.read_csv(file_path, header=header_row)

df.head()
Out[23]:
Unnamed: 0 Unnamed: 1 2018 2019 2020 2021 2022
0 Infectious Diseases NaN NaN NaN NaN NaN NaN
1 A02á´³ Diarrhoea 6.0 6.0 2.0 NaN 2.0
2 A03á´³ Dysentery NaN NaN NaN NaN NaN
3 A15á´³ Tuberculosis 22.0 20.0 20.0 31.0 17.0
4 A41á´³ Other Sepsis, including Septicaemia 62.0 46.0 52.0 32.0 45.0

At this point in the process, it felt quite difficult imagining how I'd work with the data when the csv file itself was so disorganised. Rather than asking ChatGPT more complicated questions, I decided to upload a whole new file and organise the data myself in a way that seems more apropos of csv.

In [37]:
df = pd.read_csv("datasets/Mortality cases3.csv")
df.head(5)
Out[37]:
ICD10 CODE Name of the Disease Type of disease 2018 2019 2020 2021 2022
0 A02á´³ Diarrhoea Infectious 6.0 6.0 2.0 NaN 2.0
1 A03á´³ Dysentery Infectious NaN NaN NaN NaN NaN
2 A15á´³ Tuberculosis Infectious 22.0 20.0 20.0 31.0 17.0
3 A41á´³ Other Sepsis, including Septicaemia Infectious 62.0 46.0 52.0 32.0 45.0
4 A50 Congenital Syphilis Infectious NaN NaN NaN NaN NaN
In [40]:
file_path = "datasets/Mortality cases3.csv"

df_raw = pd.read_csv(file_path)  
df_raw.head(10)
Out[40]:
ICD10 CODE Name of the Disease Type of disease 2018 2019 2020 2021 2022
0 A02á´³ Diarrhoea Infectious 6.0 6.0 2.0 NaN 2.0
1 A03á´³ Dysentery Infectious NaN NaN NaN NaN NaN
2 A15á´³ Tuberculosis Infectious 22.0 20.0 20.0 31.0 17.0
3 A41á´³ Other Sepsis, including Septicaemia Infectious 62.0 46.0 52.0 32.0 45.0
4 A50 Congenital Syphilis Infectious NaN NaN NaN NaN NaN
5 A51á´³ Genital Ulcer Infectious NaN NaN NaN NaN NaN
6 A53á´³ Syphilis Infectious NaN NaN NaN NaN NaN
7 A54á´³ Urethral/Vaginal Discharges Infectious NaN NaN NaN NaN NaN
8 A63á´³ Other STD excluding HIV/AIDS Infectious NaN 1.0 NaN 1.0 1.0
9 A77á´³ Rickettsial Disease Infectious 2.0 2.0 2.0 1.0 2.0
In [41]:
df.columns
Out[41]:
Index(['ICD10 CODE', 'Name of the Disease', 'Type of disease', '2018', '2019',
       '2020', '2021', '2022'],
      dtype='object')
In [65]:
years = ['2018', '2019', '2020', '2021', '2022']
df[years] = df[years].apply(pd.to_numeric, errors='coerce') # pd.to_numeric ensures that data is read as numbers
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[65], line 2
      1 years = ['2018', '2019', '2020', '2021', '2022']
----> 2 df[years] = df[years].apply(pd.to_numeric, errors='coerce') # pd.to_numeric ensures that data is read as numbers

File /opt/conda/lib/python3.13/site-packages/pandas/core/frame.py:4119, in DataFrame.__getitem__(self, key)
   4117     if is_iterator(key):
   4118         key = list(key)
-> 4119     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   4121 # take() does not accept boolean indexers
   4122 if getattr(indexer, "dtype", None) == bool:

File /opt/conda/lib/python3.13/site-packages/pandas/core/indexes/base.py:6212, in Index._get_indexer_strict(self, key, axis_name)
   6209 else:
   6210     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6212 self._raise_if_missing(keyarr, indexer, axis_name)
   6214 keyarr = self.take(indexer)
   6215 if isinstance(key, Index):
   6216     # GH 42790 - Preserve name from an Index

File /opt/conda/lib/python3.13/site-packages/pandas/core/indexes/base.py:6261, in Index._raise_if_missing(self, key, indexer, axis_name)
   6259 if nmissing:
   6260     if nmissing == len(indexer):
-> 6261         raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6263     not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6264     raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index(['2018', '2019', '2020', '2021', '2022'], dtype='object')] are in the [columns]"
In [66]:
df.columns
Out[66]:
Index(['ICD10 CODE', 'Name of the Disease', 'Year', 'Deaths'], dtype='object')

Data Analysis¶

Now I'm thinking my data is finally ready to be analysed, so I asked ChatGPT for help with creating an interactive visualisation on Plotly that would allow someone to select a disease and see information on it accordingly.

Some the code below was taken from ChatGPT/Gemini (as referenced) — I also used https://plotly.com/python/plotly-express/ as a reference

In [44]:
import plotly.express as px

I don't really understand the part below, but I wasn't able to create any graphics - both AI platforms noted that I had to "Make the data 'long format'" if I wanted interactive visuals, so ChatGPT provided the following code:

In [63]:
df_long = df.melt(
    id_vars=['ICD10 CODE', 'Name of the Disease', 'Type of disease'],
    value_vars=years,
    var_name='Year',
    value_name='DeathCount'
)

df_long.head()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[63], line 1
----> 1 df_long = df.melt(
      2     id_vars=['ICD10 CODE', 'Name of the Disease', 'Type of disease'],
      3     value_vars=years,
      4     var_name='Year',
      5     value_name='DeathCount'
      6 )
      8 df_long.head()

File /opt/conda/lib/python3.13/site-packages/pandas/core/frame.py:9969, in DataFrame.melt(self, id_vars, value_vars, var_name, value_name, col_level, ignore_index)
   9959 @Appender(_shared_docs["melt"] % {"caller": "df.melt(", "other": "melt"})
   9960 def melt(
   9961     self,
   (...)   9967     ignore_index: bool = True,
   9968 ) -> DataFrame:
-> 9969     return melt(
   9970         self,
   9971         id_vars=id_vars,
   9972         value_vars=value_vars,
   9973         var_name=var_name,
   9974         value_name=value_name,
   9975         col_level=col_level,
   9976         ignore_index=ignore_index,
   9977     ).__finalize__(self, method="melt")

File /opt/conda/lib/python3.13/site-packages/pandas/core/reshape/melt.py:74, in melt(frame, id_vars, value_vars, var_name, value_name, col_level, ignore_index)
     70 if missing.any():
     71     missing_labels = [
     72         lab for lab, not_found in zip(labels, missing) if not_found
     73     ]
---> 74     raise KeyError(
     75         "The following id_vars or value_vars are not present in "
     76         f"the DataFrame: {missing_labels}"
     77     )
     78 if value_vars_was_not_none:
     79     frame = frame.iloc[:, algos.unique(idx)]

KeyError: "The following id_vars or value_vars are not present in the DataFrame: ['Type of disease', '2018', '2019', '2020', '2021', '2022']"
In [60]:
print(df.columns.tolist())
['ICD10 CODE', 'Name of the Disease', 'Year', 'Deaths']
In [62]:
df.columns = df.columns.str.strip()

Then, I experimented with code as made available on the plotly website, with additional help from ChatGPT: for example, my prompt was

I want to use the following code for my analysis. df = px.data.iris() fig = px.scatter(df, x="Year", y="Deaths", color="????") fig.show() What is the "color" part?

This helped me gain a better sense of the different visualisation options

In [52]:
df = df_long
fig = px.scatter(df_long, x="Year", y="Deaths", color="Name of the Disease")
fig.show()

I'm still not sure why the code above doesn't work ...

In [53]:
df_long.head()
Out[53]:
ICD10 CODE Name of the Disease Year Deaths
0 A02á´³ Diarrhoea 2018 6.0
1 A03á´³ Dysentery 2018 NaN
2 A15á´³ Tuberculosis 2018 22.0
3 A41á´³ Other Sepsis, including Septicaemia 2018 62.0
4 A50 Congenital Syphilis 2018 NaN
In [56]:
fig = px.scatter(
    df_long,
    x="Year",
    y="Deaths",
    color="Name of the Disease",
    hover_name="Name of the Disease",
    title="Mortality Cases by Disease (Bhutan, 2018–2022)"
)

fig.show()
In [ ]: