Adip Rai - Fab Futures - Data Science
Home About

PROBABILITY IN DATA SCIENCE

Understanding how data behaves is one of the first steps in data science. Before we dive into building models or running analysis, we need to understand how the values in our dataset are spread out and that’s where probability distributions come in.

High-risk HOURS [Discrete probability distribution]¶

1. Extract crash hour from the CSV¶

In [13]:
import pandas as pd

df = pd.read_csv("datasets/MotorVehicle_CrashRecord.csv")

# Convert CRASH TIME to hour
df["CRASH HOUR"] = pd.to_datetime(
    df["CRASH TIME"],
    format="%H:%M",
    errors="coerce"
).dt.hour


# Remove invalid hours
df = df.dropna(subset=["CRASH HOUR"])
In [12]:
df["CRASH HOUR"] = pd.to_datetime(
    df["CRASH TIME"],
    format="%H:%M",
    errors="coerce"
).dt.hour

2. Probability distribution of crashes by hour¶

In [15]:
hour_prob = df["CRASH HOUR"].value_counts(normalize=True).sort_index()
hour_prob
Out[15]:
CRASH HOUR
0     0.050
1     0.040
2     0.020
3     0.015
4     0.030
5     0.020
6     0.035
7     0.020
8     0.055
9     0.060
10    0.030
11    0.040
12    0.050
13    0.040
14    0.065
15    0.030
16    0.070
17    0.090
18    0.025
19    0.055
20    0.045
21    0.055
22    0.035
23    0.025
Name: proportion, dtype: float64

3. Identify high-risk hours¶

In [16]:
high_risk_hours = hour_prob[hour_prob > hour_prob.mean()]
high_risk_hours
Out[16]:
CRASH HOUR
0     0.050
8     0.055
9     0.060
12    0.050
14    0.065
16    0.070
17    0.090
19    0.055
20    0.045
21    0.055
Name: proportion, dtype: float64

High-risk BOROUGHS [Categorical Probability Distribution]¶

1. Clean borough data¶

In [17]:
df = df.dropna(subset=["BOROUGH"])

2. Borough-wise probability distribution¶

In [18]:
borough_prob = df["BOROUGH"].value_counts(normalize=True)
borough_prob
Out[18]:
BOROUGH
BROOKLYN         0.443548
QUEENS           0.225806
MANHATTAN        0.153226
BRONX            0.137097
STATEN ISLAND    0.040323
Name: proportion, dtype: float64

3. Identify high-risk boroughs¶

In [19]:
high_risk_boroughs = borough_prob[borough_prob > borough_prob.mean()]
high_risk_boroughs
Out[19]:
BOROUGH
BROOKLYN    0.443548
QUEENS      0.225806
Name: proportion, dtype: float64
In [ ]: