PROBABILITY IN DATA SCIENCE
Understanding how data behaves is one of the first steps in data science. Before we dive into building models or running analysis, we need to understand how the values in our dataset are spread out and that’s where probability distributions come in.
High-risk HOURS [Discrete probability distribution]¶
1. Extract crash hour from the CSV¶
In [13]:
import pandas as pd
df = pd.read_csv("datasets/MotorVehicle_CrashRecord.csv")
# Convert CRASH TIME to hour
df["CRASH HOUR"] = pd.to_datetime(
df["CRASH TIME"],
format="%H:%M",
errors="coerce"
).dt.hour
# Remove invalid hours
df = df.dropna(subset=["CRASH HOUR"])
In [12]:
df["CRASH HOUR"] = pd.to_datetime(
df["CRASH TIME"],
format="%H:%M",
errors="coerce"
).dt.hour
2. Probability distribution of crashes by hour¶
In [15]:
hour_prob = df["CRASH HOUR"].value_counts(normalize=True).sort_index()
hour_prob
Out[15]:
CRASH HOUR 0 0.050 1 0.040 2 0.020 3 0.015 4 0.030 5 0.020 6 0.035 7 0.020 8 0.055 9 0.060 10 0.030 11 0.040 12 0.050 13 0.040 14 0.065 15 0.030 16 0.070 17 0.090 18 0.025 19 0.055 20 0.045 21 0.055 22 0.035 23 0.025 Name: proportion, dtype: float64
3. Identify high-risk hours¶
In [16]:
high_risk_hours = hour_prob[hour_prob > hour_prob.mean()]
high_risk_hours
Out[16]:
CRASH HOUR 0 0.050 8 0.055 9 0.060 12 0.050 14 0.065 16 0.070 17 0.090 19 0.055 20 0.045 21 0.055 Name: proportion, dtype: float64
High-risk BOROUGHS [Categorical Probability Distribution]¶
1. Clean borough data¶
In [17]:
df = df.dropna(subset=["BOROUGH"])
2. Borough-wise probability distribution¶
In [18]:
borough_prob = df["BOROUGH"].value_counts(normalize=True)
borough_prob
Out[18]:
BOROUGH BROOKLYN 0.443548 QUEENS 0.225806 MANHATTAN 0.153226 BRONX 0.137097 STATEN ISLAND 0.040323 Name: proportion, dtype: float64
3. Identify high-risk boroughs¶
In [19]:
high_risk_boroughs = borough_prob[borough_prob > borough_prob.mean()]
high_risk_boroughs
Out[19]:
BOROUGH BROOKLYN 0.443548 QUEENS 0.225806 Name: proportion, dtype: float64
In [ ]: