Adip Rai - Fab Futures - Data Science
Home About

DATA TRANSFORMATION

In data science, transformation means changing data from its original form into a new form that is more suitable for analysis, modeling, or interpretation.

🔹 Step 1: Import libraries¶

In [2]:
import pandas as pd
import numpy as np

🔹 Step 2: Load the dataset¶

In [19]:
df = pd.read_csv("datasets/MotorVehicle_CrashRecord.csv")
df.columns = df.columns.str.strip()
print(df.columns)
# View basic info
df.head(201)
Index(['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'LATITUDE', 'LONGITUDE',
       'ON STREET NAME', 'NUMBER OF PERSONS INJURED',
       'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
       'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
       'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
       'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE',
       'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3',
       'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5'],
      dtype='object')
Out[19]:
CRASH DATE CRASH TIME BOROUGH LATITUDE LONGITUDE ON STREET NAME NUMBER OF PERSONS INJURED NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED CONTRIBUTING FACTOR VEHICLE VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
0 9/11/2021 2:39 NaN NaN NaN WHITESTONE EXPRESSWAY 2 0 0 0 0 0 2 0 Aggressive Driving/Road Rage Sedan Sedan NaN NaN NaN
1 3/26/2022 11:45 NaN NaN NaN QUEENSBORO BRIDGE UPPER 1 0 0 0 0 0 1 0 Pavement Slippery Sedan NaN NaN NaN NaN
2 11/1/2023 1:29 BROOKLYN 40.621790 -73.970024 OCEAN PARKWAY 1 0 0 0 0 0 1 0 Unspecified Moped Sedan Sedan NaN NaN
3 6/29/2022 6:55 NaN NaN NaN THROGS NECK BRIDGE 0 0 0 0 0 0 0 0 Following Too Closely Sedan Pick-up Truck NaN NaN NaN
4 9/21/2022 13:21 NaN NaN NaN BROOKLYN BRIDGE 0 0 0 0 0 0 0 0 Passing Too Closely Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
195 3/23/2022 21:00 QUEENS 40.776764 -73.848015 NaN 0 0 0 0 0 0 0 0 Unspecified Sedan NaN NaN NaN NaN
196 3/26/2022 12:00 QUEENS 40.747498 -73.913090 NaN 0 0 0 0 0 0 0 0 Unspecified Sedan NaN NaN NaN NaN
197 3/26/2022 13:28 BROOKLYN 40.701637 -73.942276 GRAHAM AVENUE 1 0 0 0 0 0 1 0 Following Too Closely Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle NaN NaN NaN
198 3/26/2022 19:12 QUEENS 40.693490 -73.826546 NaN 0 0 0 0 0 0 0 0 Unsafe Speed Station Wagon/Sport Utility Vehicle Sedan NaN NaN NaN
199 3/26/2022 21:29 QUEENS 40.726463 -73.859430 WETHEROLE STREET 1 0 1 0 0 0 0 0 Unsafe Speed Bike NaN NaN NaN NaN

200 rows × 20 columns

🔹 Step 3: Inspect data types & missing values¶

In [20]:
df.info()
df.isna().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 20 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   CRASH DATE                     200 non-null    object 
 1   CRASH TIME                     200 non-null    object 
 2   BOROUGH                        124 non-null    object 
 3   LATITUDE                       169 non-null    float64
 4   LONGITUDE                      169 non-null    float64
 5   ON STREET NAME                 151 non-null    object 
 6   NUMBER OF PERSONS INJURED      200 non-null    int64  
 7   NUMBER OF PERSONS KILLED       200 non-null    int64  
 8   NUMBER OF PEDESTRIANS INJURED  200 non-null    int64  
 9   NUMBER OF PEDESTRIANS KILLED   200 non-null    int64  
 10  NUMBER OF CYCLIST INJURED      200 non-null    int64  
 11  NUMBER OF CYCLIST KILLED       200 non-null    int64  
 12  NUMBER OF MOTORIST INJURED     200 non-null    int64  
 13  NUMBER OF MOTORIST KILLED      200 non-null    int64  
 14  CONTRIBUTING FACTOR VEHICLE    198 non-null    object 
 15  VEHICLE TYPE CODE 1            196 non-null    object 
 16  VEHICLE TYPE CODE 2            123 non-null    object 
 17  VEHICLE TYPE CODE 3            11 non-null     object 
 18  VEHICLE TYPE CODE 4            3 non-null      object 
 19  VEHICLE TYPE CODE 5            0 non-null      float64
dtypes: float64(3), int64(8), object(9)
memory usage: 31.4+ KB
Out[20]:
CRASH DATE                         0
CRASH TIME                         0
BOROUGH                           76
LATITUDE                          31
LONGITUDE                         31
ON STREET NAME                    49
NUMBER OF PERSONS INJURED          0
NUMBER OF PERSONS KILLED           0
NUMBER OF PEDESTRIANS INJURED      0
NUMBER OF PEDESTRIANS KILLED       0
NUMBER OF CYCLIST INJURED          0
NUMBER OF CYCLIST KILLED           0
NUMBER OF MOTORIST INJURED         0
NUMBER OF MOTORIST KILLED          0
CONTRIBUTING FACTOR VEHICLE        2
VEHICLE TYPE CODE 1                4
VEHICLE TYPE CODE 2               77
VEHICLE TYPE CODE 3              189
VEHICLE TYPE CODE 4              197
VEHICLE TYPE CODE 5              200
dtype: int64

🔹 Step 4: DATA TRANSFORMATION STEPS¶

4.1 Convert Date column (if available)¶

In [26]:
df['CRASH DATE'] = pd.to_datetime(df['CRASH DATE'], errors='coerce')

# Extract useful features
df['Year'] = df['CRASH DATE'].dt.year
df['Month'] = df['CRASH DATE'].dt.month
df['Day'] = df['CRASH DATE'].dt.day
df['Hour'] = df['CRASH DATE'].dt.hour

4.2 Transform Hour column (important for density analysis)¶

In [27]:
df['Hour'] = pd.to_numeric(df['Hour'], errors='coerce')
df = df.dropna(subset=['Hour'])

4.3 Create Time-of-Day feature (Feature Engineering)¶

In [28]:
def time_of_day(hour):
    if 0 <= hour < 6:
        return 'Late Night'
    elif 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    else:
        return 'Evening'

df['Time_of_Day'] = df['Hour'].apply(time_of_day)

4.4 Encode categorical variables (Borough / Location)¶

In [38]:
df = pd.get_dummies(df, columns=['Borough'], drop_first=True)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[38], line 1
----> 1 df = pd.get_dummies(df, columns=['Borough'], drop_first=True)

File /opt/conda/lib/python3.13/site-packages/pandas/core/reshape/encoding.py:170, in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype)
    168     raise TypeError("Input must be a list-like for parameter `columns`")
    169 else:
--> 170     data_to_encode = data[columns]
    172 # validate prefixes and separator to avoid silently dropping cols
    173 def check_len(item, name: str):

File /opt/conda/lib/python3.13/site-packages/pandas/core/frame.py:4119, in DataFrame.__getitem__(self, key)
   4117     if is_iterator(key):
   4118         key = list(key)
-> 4119     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   4121 # take() does not accept boolean indexers
   4122 if getattr(indexer, "dtype", None) == bool:

File /opt/conda/lib/python3.13/site-packages/pandas/core/indexes/base.py:6212, in Index._get_indexer_strict(self, key, axis_name)
   6209 else:
   6210     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6212 self._raise_if_missing(keyarr, indexer, axis_name)
   6214 keyarr = self.take(indexer)
   6215 if isinstance(key, Index):
   6216     # GH 42790 - Preserve name from an Index

File /opt/conda/lib/python3.13/site-packages/pandas/core/indexes/base.py:6261, in Index._raise_if_missing(self, key, indexer, axis_name)
   6259 if nmissing:
   6260     if nmissing == len(indexer):
-> 6261         raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6263     not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6264     raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index(['Borough'], dtype='object')] are in the [columns]"

4.5 Normalize numerical columns (Scaling)¶

In [31]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df[['Hour']] = scaler.fit_transform(df[['Hour']])

4.6 Log transformation (optional but powerful)¶

In [32]:
df['Log_Hour'] = np.log1p(df['Hour'])

🔹 Step 5: Save transformed dataset¶

In [34]:
df.to_csv("datasets/MotorVehicle_CrashRecord_Transformed.csv", index=False)
No description has been provided for this image
In [40]:
df.head(201)
Out[40]:
CRASH DATE CRASH TIME LATITUDE LONGITUDE ON STREET NAME NUMBER OF PERSONS INJURED NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED ... Year Month Day Hour Time_of_Day BOROUGH_BROOKLYN BOROUGH_MANHATTAN BOROUGH_QUEENS BOROUGH_STATEN ISLAND Log_Hour
0 2021-09-11 2:39 NaN NaN WHITESTONE EXPRESSWAY 2 0 0 0 0 ... 2021 9 11 0.0 Late Night False False False False 0.0
1 2022-03-26 11:45 NaN NaN QUEENSBORO BRIDGE UPPER 1 0 0 0 0 ... 2022 3 26 0.0 Late Night False False False False 0.0
2 2023-11-01 1:29 40.621790 -73.970024 OCEAN PARKWAY 1 0 0 0 0 ... 2023 11 1 0.0 Late Night True False False False 0.0
3 2022-06-29 6:55 NaN NaN THROGS NECK BRIDGE 0 0 0 0 0 ... 2022 6 29 0.0 Late Night False False False False 0.0
4 2022-09-21 13:21 NaN NaN BROOKLYN BRIDGE 0 0 0 0 0 ... 2022 9 21 0.0 Late Night False False False False 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
195 2022-03-23 21:00 40.776764 -73.848015 NaN 0 0 0 0 0 ... 2022 3 23 0.0 Late Night False False True False 0.0
196 2022-03-26 12:00 40.747498 -73.913090 NaN 0 0 0 0 0 ... 2022 3 26 0.0 Late Night False False True False 0.0
197 2022-03-26 13:28 40.701637 -73.942276 GRAHAM AVENUE 1 0 0 0 0 ... 2022 3 26 0.0 Late Night True False False False 0.0
198 2022-03-26 19:12 40.693490 -73.826546 NaN 0 0 0 0 0 ... 2022 3 26 0.0 Late Night False False True False 0.0
199 2022-03-26 21:29 40.726463 -73.859430 WETHEROLE STREET 1 0 1 0 0 ... 2022 3 26 0.0 Late Night False False True False 0.0

200 rows × 29 columns

In [ ]: