< Home
Week 6: Assignment ~ Investigating the Probability Distribution of my Data¶
In this assignment I have trained a linear regression model to predict house price (in $1000s) from size (sqft), leveraging the clear linear relationship in the data.
It is a Supervised Learning: It has labeled data — each input (house size) has a corresponding output (price). The goal is to learn a function f such that:
price ≈ f(size)
Step 1: Loading and Cleaning the Data¶
In [5]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
# Load data — skip the first header row and handle empty rows
df = pd.read_csv("datasets/DataSet_CommonDiseases.csv", header=1)
df = df.dropna(how='all') # remove fully empty rows
df = df.fillna(0) # replace blanks with 0
# Find the row for "Diarrhoea"
diarrhoea_row = df[df.iloc[:, 0] == 'Diarrhoea']
# Female columns are: F, F.1, F.2, ..., F.9 (10 age groups)
female_counts = diarrhoea_row.iloc[:, [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]].values.flatten()
female_counts = female_counts.astype(int) # make sure they're numbers
# Age group labels
age_groups = [
'0–29 Days', '1–11 Months', '1–4 Years', '5–9 Years',
'10–14 Years', '15–19 Years', '20–24 Years',
'25–49 Years', '50–59 Years', '60+ Years'
]
# Create a tidy DataFrame
data = pd.DataFrame({
'Age Group': age_groups,
'Female Cases': female_counts
})
data
Out[5]:
| Age Group | Female Cases | |
|---|---|---|
| 0 | 0–29 Days | 66 |
| 1 | 1–11 Months | 906 |
| 2 | 1–4 Years | 2800 |
| 3 | 5–9 Years | 1753 |
| 4 | 10–14 Years | 1378 |
| 5 | 15–19 Years | 997 |
| 6 | 20–24 Years | 920 |
| 7 | 25–49 Years | 2928 |
| 8 | 50–59 Years | 865 |
| 9 | 60+ Years | 1387 |
Step 2: Plotting Histogram¶
In [6]:
plt.figure(figsize=(8, 4))
plt.hist(data['Female Cases'], bins=8, color='lightcoral', edgecolor='black', alpha=0.8)
plt.title("Histogram of Female Diarrhoea Cases (All Age Groups)")
plt.xlabel("Case Count")
plt.ylabel("Number of Age Groups")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Step 3: Plotting Bargraph¶
In [ ]:
In [7]:
plt.figure(figsize=(10, 4))
plt.bar(data['Age Group'], data['Female Cases'], color='steelblue')
plt.title("Female Diarrhoea Cases by Age Group")
plt.ylabel("Number of Cases")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
In [ ]: