Aristarco - Fab Futures - Data Science
Home About

Transforms¶

Data transformation¶

In data science is the crucial process of converting raw, messy data from various sources into a clean, standardized, and usable format, making it ready for analysis, modeling (like machine learning), reporting, and storage, ensuring accuracy, consistency, and optimal structure for insights. It involves steps like cleaning errors, standardizing formats, handling missing values, aggregating data, and changing scales (e.g., normalization, log transforms) to meet specific analytical goals.

Key Aspects of Data Transformation:¶

  • Data Cleaning: Fixing errors, removing duplicates, handling outliers, and correcting inconsistencies.
  • Format Conversion: Changing data from one type to another (e.g., CSV to JSON, text to database tables).
  • Structuring: Combining columns (e.g., first/last name to full name) or restructuring data for different systems.

Value Transformation:

  • Normalization/Standardization: Scaling numerical data to a specific range (e.g., 0-1) or mean/std dev.
  • Encoding: Converting categorical data (like "Male"/"Female") into numerical values (0/1) for algorithms.
  • Aggregation: Summarizing data (e.g., summing sales by month).
  • Feature Engineering: Creating new, more informative features from existing data, essential for ML.

Why It's Important:

  • Improves Accuracy: Ensures reliable analysis and decisions.
  • Enhances Usability: Makes data accessible and understandable for different tools and users.
  • Supports Algorithms: Many ML models require numerical, clean, and properly scaled input.
  • Data Integration: Unifies data from disparate sources into a consistent format.

In essence, transformation turns chaotic raw data into high-quality, actionable assets, bridging the gap between messy real-world data and valuable business insights.

Principal Component Analysis¶

PCA, or Principal Component Analysis, is a data science technique for dimensionality reduction that transforms high-dimensional data into a smaller set of uncorrelated variables called principal components. This simplifies datasets by compressing most of the original information into fewer features, which can improve model performance, reduce noise, and make complex data easier to visualize and analyze.

How PCA is used in data science

Dimensionality Reduction: PCA reduces the number of features in a dataset while retaining the most important information, which can significantly speed up the training time for machine learning models.

Noise Reduction: By focusing on components with the highest variance, PCA can filter out random noise in the data, leading to more robust models.

Data Visualization: PCA can project high-dimensional data onto a two or three-dimensional space, making it possible to visualize complex relationships, patterns, and clusters that would otherwise be invisible.

Feature Engineering: It creates a new, smaller set of features (the principal components) that can be used as input for other machine learning algorithms.

Exploratory Data Analysis: It helps in understanding the structure of the data by identifying the main directions of variance, which can reveal underlying patterns.

Key benefits¶

Improved model performance: By reducing the number of features and filtering noise, PCA can improve the accuracy and efficiency of machine learning models.

Computational efficiency: Processing fewer dimensions leads to faster training and inference times.

Enhanced interpretability: Visualizing data in a lower-dimensional space provides more intuitive insights than working with hundreds of variables.

Overcoming the Curse of Dimensionality: It helps mitigate the negative impacts of having too many features on model performance.

Source: IBM What is Principal Component Analysis (PCA)?,of%20variables%2C%20called%20principal%20components.)

Data standardization¶

Data Standarization is converting data from various sources into a consistent, uniform format, ensuring it follows predefined rules for structure, definition, and values, making it comparable, accessible, and reliable for analysis, integration, and use in applications like machine learning. It addresses inconsistencies (e.g., "St." vs. "Street") and ensures features with different scales don't unfairly influence models.

Key Aspects of Data Standardization

  • Consistency: Creates uniform formats, labels, and definitions across datasets.
  • Integration: Enables data from different systems to work together seamlessly.
  • Analysis: Improves data quality, accuracy, and reliability for reporting and insights.
  • Machine Learning: Often involves transforming data so it has a mean of 0 and standard deviation of 1 (Z-score) to prevent high-value features from dominating, especially for algorithms like SVM, KNN, or PCA.

How it Works (Examples)

  • Address Data: Changing "Main St." to "Main Street" or "Maint St" to "Main Street".
  • Date Formats: Converting various date entries (e.g., 1/1/2024, Jan 1, 2024) into a single format like YYYY-MM-DD.
  • Categorical Data: Mapping "USA," "U.S.," and "United States" to a single "United States".

Why It's Important

  • Better Decisions: High-quality, consistent data leads to more reliable business intelligence.
  • Efficiency: Reduces errors, redundancies, and operational costs.
  • Compliance: Helps meet regulatory requirements by ensuring data accuracy.
  • System Compatibility: Allows different software and platforms to understand and use the data

Example¶

Gemini Prompt: Act as a data science professor. Select a dataset to learn data standarization and PCA

Hello, and welcome to class! I am delighted to guide you through two fundamental pillars of machine learning preprocessing: Data Standardization and Principal Component Analysis (PCA).

To truly understand these concepts, we need a dataset that is "messy" enough to require standardization, but structured enough to make the results of PCA obvious.

The Selected Dataset: The Wine Dataset

For this lesson, we will use the classic UCI Wine Dataset.Why this dataset?This dataset contains the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars (varieties).Rows (Samples): 178 individual wines.Columns (Features): 13 continuous chemical attributes (e.g., Alcohol, Malic Acid, Magnesium, Color Intensity, etc.).

This is the perfect candidate because the features have vastly different scales.

Phase 1: The Problem of Scale Imagine we feed this raw data directly into a machine learning model. Let's look at two specific features from the dataset to see the problem:Feature Typical Value Range Unit Non flavanoid Phenols0.13 - 0.66mg/L (very small)Proline278 - 1680mg/L (very large) The Trap: PCA works by finding the directions (Principal Components) where the data varies the most. If we don't fix the scales, PCA will look at Proline (variance in the hundreds) and think it is infinitely more important than Phenols (variance < 1).

The algorithm isn't "smart" enough to know that small changes in Phenols might be chemically significant. It just sees big numbers vs. small numbers.

Phase 2: Data Standardization (Z-Score) To fix this, we force all features to speak the same language. We apply Standardization (often called Z-score normalization) to every column. The Formula: z = {x - mean}/{Std.dev.}

Where: x is the original value minus the mean of that feature divided by the standard deviation.

The Result: After this step, every chemical feature (Alcohol, Magnesium, Proline) will have:A Mean of 0. A Standard Deviation of 1. Now, a value of 2.0 in Proline means "Two standard deviations above average," and a value of 2.0 in Phenols means the exact same thing relative to its own distribution. They are now comparable.

Phase 3: Principal Component Analysis (PCA) Now that our data is standardized, we are ready for PCA.

What PCA does: It takes our 13-dimensional problem (which is impossible to visualize) and rotates it to find the "axes" that capture the most information.

PC1 (First Principal Component): The direction where the data is most spread out.PC2 (Second Principal Component): The direction orthogonal (perpendicular) to PC1 that captures the second most variance.

The Outcome on the Wine Data: Typically, just the first 3 Principal Components can capture over 65% of the total information contained in the original 13 features.Instead of tracking 13 chemicals, we can just plot the wines on a 2D graph using PC1 and PC2.

You will likely see three distinct clusters appear—these correspond to the three specific wine cultivars! This reveals the hidden structure of the data that was previously buried in the spreadsheet.

Summary of the WorkflowLoad Data: Import the 13 features of the Wine dataset. Standardize: Apply StandardScaler so Proline doesn't dominate. Apply PCA: Compress the 13 dimensions down to 2 or 3 principal components. Visualize: Plot the new components to see the separation of wine varieties.

Next StepTo solidify this knowledge, we should get our hands dirty with code.

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# --- STEP 1: LOAD THE DATA ---
# We load the dataset and convert it to a DataFrame for easier viewing
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target # These are the wine categories (0, 1, 2)

print("### PROFESSOR'S NOTE: RAW DATA STATISTICS ###")
print("Notice the massive difference in scale between 'proline' and 'nonflavanoid_phenols':")
print(df[['proline', 'nonflavanoid_phenols']].describe().loc[['mean', 'std']])
print("-" * 50)


# --- STEP 2: STANDARDIZATION ---
# This brings all features to Mean=0, Std=1
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Let's verify the standardization worked
df_scaled = pd.DataFrame(X_scaled, columns=df.columns)
print("\n### PROFESSOR'S NOTE: STANDARDIZED DATA ###")
print("Notice how the means are now effectively 0 and std is 1:")
print(df_scaled[['proline', 'nonflavanoid_phenols']].describe().loc[['mean', 'std']].round(2))
print("-" * 50)


# --- STEP 3: APPLY PCA ---
# We compress the 13 columns down to just 2 Principal Components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Calculate how much information (variance) we preserved
explained_variance = pca.explained_variance_ratio_
total_variance = sum(explained_variance) * 100

print(f"\n### PROFESSOR'S NOTE: PCA RESULTS ###")
print(f"PC1 explains: {explained_variance[0]:.2%} of the variance")
print(f"PC2 explains: {explained_variance[1]:.2%} of the variance")
print(f"Together, we preserved {total_variance:.2f}% of the information using only 2 dimensions!")


# --- STEP 4: VISUALIZATION ---
plt.figure(figsize=(10, 8))
sns.scatterplot(
    x=X_pca[:, 0], 
    y=X_pca[:, 1], 
    hue=data.target_names[y], # Color by wine variety
    palette='viridis',
    s=100
)

plt.title(f'PCA of Wine Dataset (Preserved Variance: {total_variance:.2f}%)', fontsize=15)
plt.xlabel(f'Principal Component 1 ({explained_variance[0]:.2%})', fontsize=12)
plt.ylabel(f'Principal Component 2 ({explained_variance[1]:.2%})', fontsize=12)
plt.axhline(0, color='grey', linestyle='--', linewidth=0.8)
plt.axvline(0, color='grey', linestyle='--', linewidth=0.8)
plt.legend(title='Wine Cultivar')
plt.grid(True, alpha=0.3)

plt.show()
### PROFESSOR'S NOTE: RAW DATA STATISTICS ###
Notice the massive difference in scale between 'proline' and 'nonflavanoid_phenols':
         proline  nonflavanoid_phenols
mean  746.893258              0.361854
std   314.907474              0.124453
--------------------------------------------------

### PROFESSOR'S NOTE: STANDARDIZED DATA ###
Notice how the means are now effectively 0 and std is 1:
      proline  nonflavanoid_phenols
mean     -0.0                  -0.0
std       1.0                   1.0
--------------------------------------------------

### PROFESSOR'S NOTE: PCA RESULTS ###
PC1 explains: 36.20% of the variance
PC2 explains: 19.21% of the variance
Together, we preserved 55.41% of the information using only 2 dimensions!
No description has been provided for this image

What to look for when you run this:

The Console Output: Look closely at the "Raw Data" printout. You will see proline has a mean around 746, while nonflavanoid_phenols has a mean of 0.36. Without standardization, Proline would dominate the calculation completely.

The Plot: You will see three distinct clusters of dots. Each color represents a different wine cultivator.

Notice how well separated they are? This proves that even though we threw away 11 dimensions of data, we kept the essential structure that distinguishes one wine from another.

Prompt: Explain: the meaning of the chart, what components are PC1 and PC2 and what are and how to interpret the loading scores¶

1. The Map: The Scatter Plot

  • The chart you just generated is essentially a similarity map.
  • The Points: Each dot is one specific bottle of wine.
  • The Distance: In this new 2D space, the distance between points represents chemical similarity.

If two dots are right next to each other, those two wines have a nearly identical chemical profile. If they are on opposite sides of the graph, they are chemically very different.

  • The Axes (PC1 and PC2): These are the new "directions" or "rulers" we created to measure the wines. We are no longer measuring them by "Alcohol" or "Ash"; we are measuring them by "Principal Component 1" and "Principal Component 2."

2. The Components: What are PC1 and PC2?Students often ask: "Is PC1 just the Alcohol column?" No. Think of PC1 and PC2 as Super-Variables or Blends.Imagine you are making a smoothie. The original features (Alcohol, Malic Acid, Ash) are the raw ingredients (Banana, Strawberry, Kale).

PC1 is a specific smoothie recipe: It might be 50% Banana, 30% Strawberry, and -20% Kale.

PC2 is a completely different recipe: It might be 10% Banana, -40% Strawberry, and 80% Kale.

Mathematically, they are Linear Combinations of your original standardized features.PC1 = (w_1 x Alcohol) + (w_2 x Malic Acid) + ... + (w_13 x Proline)

PC1 is the blend that captures the maximum variance (the widest spread of the data).PC2 is the blend that captures the second most variance, but it must be uncorrelated (perpendicular) to PC1.

3. The Recipe: Loading Scores

How do we know the "recipe" for these blends? We look at the Loading Scores.

A Loading Score tells you how much a specific original feature contributes to a Principal Component. It is the correlation coefficient between the original feature and the component.

How to Interpret the Scores:

High Positive Value (e.g., +0.9): Strong positive influence. When this feature increases, the PC value increases.

Example: If "Flavanoids" has a loading of +0.9 on PC1, then wines with high PC1 values usually have very high Flavanoids.

High Negative Value (e.g., -0.8): Strong negative influence. When this feature increases, the PC value decreases.

Example: If "Alcalinity" has a loading of -0.8 on PC1, then wines with high PC1 values have low Alcalinity.

Near Zero (e.g., 0.05): No influence. This feature is just noise to this specific component.

Let's Reveal the Recipe (The Code)¶

Let's modify our script slightly to print out these "recipes" so we can see exactly what chemically defines our wine clusters.

In [2]:
# Here is the Python script to perform the analysis we discussed. I have annotated the code heavily so you can follow the logic as if I were standing right there at the whiteboard.

# This script uses scikit-learn (the industry standard for machine learning in Python) along with pandas for data handling and matplotlib for visualization.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# --- STEP 1: LOAD THE DATA ---
# We load the dataset and convert it to a DataFrame for easier viewing
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target # These are the wine categories (0, 1, 2)

print("### PROFESSOR'S NOTE: RAW DATA STATISTICS ###")
print("Notice the massive difference in scale between 'proline' and 'nonflavanoid_phenols':")
print(df[['proline', 'nonflavanoid_phenols']].describe().loc[['mean', 'std']])
print("-" * 50)


# --- STEP 2: STANDARDIZATION ---
# This brings all features to Mean=0, Std=1
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Let's verify the standardization worked
df_scaled = pd.DataFrame(X_scaled, columns=df.columns)
print("\n### PROFESSOR'S NOTE: STANDARDIZED DATA ###")
print("Notice how the means are now effectively 0 and std is 1:")
print(df_scaled[['proline', 'nonflavanoid_phenols']].describe().loc[['mean', 'std']].round(2))
print("-" * 50)


# --- STEP 3: APPLY PCA ---
# We compress the 13 columns down to just 2 Principal Components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Calculate how much information (variance) we preserved
explained_variance = pca.explained_variance_ratio_
total_variance = sum(explained_variance) * 100

print(f"\n### PROFESSOR'S NOTE: PCA RESULTS ###")
print(f"PC1 explains: {explained_variance[0]:.2%} of the variance")
print(f"PC2 explains: {explained_variance[1]:.2%} of the variance")
print(f"Together, we preserved {total_variance:.2f}% of the information using only 2 dimensions!")


# --- STEP 4: VISUALIZATION ---
plt.figure(figsize=(10, 8))
sns.scatterplot(
    x=X_pca[:, 0], 
    y=X_pca[:, 1], 
    hue=data.target_names[y], # Color by wine variety
    palette='viridis',
    s=100
)

plt.title(f'PCA of Wine Dataset (Preserved Variance: {total_variance:.2f}%)', fontsize=15)
plt.xlabel(f'Principal Component 1 ({explained_variance[0]:.2%})', fontsize=12)
plt.ylabel(f'Principal Component 2 ({explained_variance[1]:.2%})', fontsize=12)
plt.axhline(0, color='grey', linestyle='--', linewidth=0.8)
plt.axvline(0, color='grey', linestyle='--', linewidth=0.8)
plt.legend(title='Wine Cultivar')
plt.grid(True, alpha=0.3)

plt.show()

# Create a DataFrame for the "Loadings" (The Recipe)
loadings = pd.DataFrame(
    pca.components_.T,  # Transpose the matrix
    columns=['PC1', 'PC2'], 
    index=df.columns
)

print("### PROFESSOR'S ANALYSIS: The 'Recipe' for PC1 ###")
# We sort by PC1 to see the strongest drivers
print(loadings['PC1'].sort_values(ascending=False))
### PROFESSOR'S NOTE: RAW DATA STATISTICS ###
Notice the massive difference in scale between 'proline' and 'nonflavanoid_phenols':
         proline  nonflavanoid_phenols
mean  746.893258              0.361854
std   314.907474              0.124453
--------------------------------------------------

### PROFESSOR'S NOTE: STANDARDIZED DATA ###
Notice how the means are now effectively 0 and std is 1:
      proline  nonflavanoid_phenols
mean     -0.0                  -0.0
std       1.0                   1.0
--------------------------------------------------

### PROFESSOR'S NOTE: PCA RESULTS ###
PC1 explains: 36.20% of the variance
PC2 explains: 19.21% of the variance
Together, we preserved 55.41% of the information using only 2 dimensions!
No description has been provided for this image
### PROFESSOR'S ANALYSIS: The 'Recipe' for PC1 ###
flavanoids                      0.422934
total_phenols                   0.394661
od280/od315_of_diluted_wines    0.376167
proanthocyanins                 0.313429
hue                             0.296715
proline                         0.286752
alcohol                         0.144329
magnesium                       0.141992
ash                            -0.002051
color_intensity                -0.088617
alcalinity_of_ash              -0.239320
malic_acid                     -0.245188
nonflavanoid_phenols           -0.298533
Name: PC1, dtype: float64

PC1 is usually dominated by Flavanoids, Phenols, and OD280/OD315. This suggests PC1 separates wines based on their complexity and maturity.

PC2 might be dominated by Color Intensity and Alcohol. This suggests PC2 separates wines based on their boldness or visual appearance.

Summary

When you look at your chart:

Moving Left-to-Right (along PC1) means you are changing the Phenol/Flavanoid content.

Moving Up-and-Down (along PC2) means you are changing the Color intensity.

Next Step¶

We currently used 2 components because it's easy to plot on a 2D screen. But we started with 13 dimensions. How do we know if 2 is enough? Maybe we need 3? or 5?

Prompt explain the "Scree Plot" and the "Cumulative Variance" rule to scientifically determine the perfect number of components to keep¶

This is the "Goldilocks Problem" of Data Science: How much is too much, and how little is too little?

We want to reduce dimensions to simplify the data, but we don't want to lose important information. To make this decision scientifically, we use two specific tools.

Tool 1: The Scree Plot (The "Elbow" Method)

Imagine you are squeezing juice from oranges.

  • PC1 (First Squeeze): You get a huge glass of juice.

  • PC2 (Second Squeeze): You get half a glass.

  • PC3 (Third Squeeze): You get a few drops.

  • PC13 (Last Squeeze): You are just breaking the rind; there is nothing left.

A Scree Plot visualizes this "juice" (variance) for each component. We look for the "Elbow"—the point where the curve bends sharply and flattens out.

Before the Elbow: Each new component adds significant information.

After the Elbow: You are essentially just adding noise.

Tool 2: Cumulative Variance (The "Threshold" Method)

This is a strictly numerical rule. You decide in advance how much information you need to keep (e.g., "I need to preserve 90% of the original data's structure").

You then stack the components on top of each other until you hit that number.

PC1: 36%

PC1 + PC2: 55%

PC1 + PC2 + PC3: 66%

...

PC1 ... PC7: 90% (Stop here!)

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# 1. Load and Standardize (Same as before)
data = load_wine()
X = data.data
X_scaled = StandardScaler().fit_transform(X)

# 2. Fit PCA with ALL components instead of just 2
pca_all = PCA()
pca_all.fit(X_scaled)

# 3. Extract Variance Data
variance_ratios = pca_all.explained_variance_ratio_
cumulative_variance = np.cumsum(variance_ratios)

# 4. Plotting
plt.figure(figsize=(12, 6))

# Plot 1: The Scree Plot (Individual Variance)
plt.bar(
    range(1, 14), 
    variance_ratios, 
    alpha=0.7, 
    label='Individual Variance (Scree)',
    color='#4c72b0'
)

# Plot 2: Cumulative Variance (Step Plot)
plt.step(
    range(1, 14), 
    cumulative_variance, 
    where='mid', 
    label='Cumulative Variance',
    color='#c44e52',
    linewidth=2
)

# Visual Guide Lines
plt.axhline(y=0.90, color='grey', linestyle='--', label='90% Threshold')
plt.axhline(y=0.80, color='grey', linestyle=':', label='80% Threshold')

plt.title('Scree Plot & Cumulative Variance (Wine Dataset)', fontsize=16)
plt.xlabel('Principal Component Index', fontsize=12)
plt.ylabel('Explained Variance Ratio', fontsize=12)
plt.xticks(range(1, 14))
plt.legend(loc='center right')
plt.grid(True, alpha=0.3)

plt.show()

# Print exact numbers for decision making
print("### PROFESSOR'S ANALYSIS ###")
for i, (var, cum_var) in enumerate(zip(variance_ratios, cumulative_variance)):
    print(f"PC{i+1}: Explains {var:.2%} | Cumulative: {cum_var:.2%}")
No description has been provided for this image
### PROFESSOR'S ANALYSIS ###
PC1: Explains 36.20% | Cumulative: 36.20%
PC2: Explains 19.21% | Cumulative: 55.41%
PC3: Explains 11.12% | Cumulative: 66.53%
PC4: Explains 7.07% | Cumulative: 73.60%
PC5: Explains 6.56% | Cumulative: 80.16%
PC6: Explains 4.94% | Cumulative: 85.10%
PC7: Explains 4.24% | Cumulative: 89.34%
PC8: Explains 2.68% | Cumulative: 92.02%
PC9: Explains 2.22% | Cumulative: 94.24%
PC10: Explains 1.93% | Cumulative: 96.17%
PC11: Explains 1.74% | Cumulative: 97.91%
PC12: Explains 1.30% | Cumulative: 99.20%
PC13: Explains 0.80% | Cumulative: 100.00%

Interpretation of the Output¶

When you run this code, look at the output table and chart:

The Elbow: Look at the blue bars (Scree). You will likely see a steep drop after PC3. The bars for PC4 through PC13 become quite flat. This suggests the intrinsic dimensionality is likely around 3.

The Threshold: Look at the red line (Cumulative).

PC1 + PC2: ~55% (Good for 2D visualization, but we lost nearly half the data).

PC1 through PC3: ~66%.

PC1 through PC7 or PC8: This is where we usually cross the 90% threshold.

The Verdict:

If your goal is Visualization (making a chart for a slide deck), stop at 2 or 3 components. We accept the loss of information for the sake of plotting.

If your goal is Machine Learning (feeding this data into a classifier), you should likely keep the first 7 or 8 components to ensure the model has enough signal (90% variance) to learn effectively.

Learning Points¶

  • Data Transform is the crucial process of converting raw, messy data from various sources into a clean, standardized, and usable format, making it ready for analysis
  • PCA transforms high-dimensional data into a smaller set of uncorrelated variables called principal components.
  • Data Standarization is converting data from various sources into a consistent, uniform format, ensuring it follows predefined rules for structure, definition, and values, making it comparable, accessible, and reliable for analysis
  • The Elbow: Suggests the intrinsic dimensionality when it flattens.
  • The Threshold: is Cumulative Variance.
  • If your goal is Visualization (making a chart for a slide deck), stop at 2 or 3 components. We accept the loss of information for the sake of plotting.
  • If your goal is Machine Learning (feeding this data into a classifier), you should likely keep the first 7 or 8 components to ensure the model has enough signal (90% variance) to learn effectively.
In [ ]: