Aristarco - Fab Futures - Data Science
Home About

Probability¶

Clarifying concepts¶

Entropy:¶

I Found a very interesting article in Linkedin The Power of Entropy in Data Science: Insights and Applications by Yoav Avneon, PhD

Summary

In data science, entropy can be used to identify the most informative features in a dataset, helping to improve the accuracy of machine learning algorithms. Entropy is often used as a measure of uncertainty in machine learning algorithms.

Feature Selection Using Entropy: Feature selection is an essential step in many data science projects, and entropy-based methods can be used to identify the most informative features in a dataset. By selecting features with high information gain, data scientists can reduce the amount of data required for modeling, making the process faster and more efficient.

Anomaly Detection Using Entropy: Anomaly detection is a critical task in data science, and entropy can be used to detect anomalies in data and improve anomaly detection algorithms. By identifying events with low probability or high entropy, data scientists can improve the accuracy of their models and identify potential sources of error.

Other uses for entropy are: Network Analysis, Anomaly Detection, Data Compression, Time Series Analysis, Information Retrieval, and Data Visualization.

One comment has a very interesting summary:

In data science, entropy is often used to quantify the amount of information in a dataset. Only some of the data that is held is informative.

To say that your data holds high entropy is to say that you can't hold or grasp the overall meaning or idea behind a given content. It's like holding onto the sand that keeps slipping through your fingers.

On the other hand, low entropy in a dataset indicates that the data has a high degree of order or structure, which makes it easier to understand and analyze. Using entropy, data scientists can quantify the amount of order or disorder in a dataset and use that information to gain insights and make predictions. Dany Saban

Bayesian theorem explained in video¶

Super Simple Explanation of Bayes Theorem!

Diamonds Dataset¶

Looking for examples I found a very common example for Data Science using Diamonds shape, size and value. It is called the Diamonds dataset. It is integrated in the seaborn library. It is widely considered the "gold standard" for practicing distribution analysis because it is large (approx 54,000), clean, and contains continuous variables that follow interesting, non-normal distributions.

Why the Diamonds Dataset?¶

Volume: It has enough data points to create smooth, high-resolution histograms. Skewness: The price column is heavily right-skewed (long tail), which is perfect for practicing fitting Exponential or Log-Normal probability density functions (PDFs). Multimodality: The carat column shows peaks at specific weights (0.5, 1.0, 1.5), reflecting human bias in diamond cutting. This is excellent for learning why a simple "Normal Distribution" often fails in the real world.

So first we are calling the dataset and look at its data

In [2]:
# Gemini Prompt: Act as a data scientist. Recomend me a good dataset to try nice hystograms and probability functions in python

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

# 1. Load the dataset (built into Seaborn)
df = sns.load_dataset('diamonds')

df.head()
Out[2]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
In [5]:
# Also from Gemini: Python Implementation
# Here is a complete script using seaborn for plotting and scipy.stats to mathematically fit a probability function to the data.

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

# Set up the plotting environment
sns.set_theme(style="whitegrid")
plt.figure(figsize=(14, 6))

# --- PLOT 1: The 'Price' Histogram with a Fitted Distribution ---
plt.subplot(1, 2, 1)

# A. Plot the actual data histogram
sns.histplot(df['price'], kde=False, stat="density", bins=50, color="skyblue", alpha=0.6, label="Actual Data")

# B. Fit a theoretical distribution (Exponential) to the data
# We use 'expon' because prices are often right-skewed and cannot be negative
loc, scale = stats.expon.fit(df['price']) 
x_range = np.linspace(df['price'].min(), df['price'].max(), 1000)
pdf_fitted = stats.expon.pdf(x_range, loc, scale)

# C. Plot the PDF line
plt.plot(x_range, pdf_fitted, 'r-', lw=3, label=f'Fitted Exponential PDF')
plt.title('Diamond Price: Right-Skewed Distribution')
plt.legend()

# --- PLOT 2: The 'Carat' Histogram (Multimodal/Peaks) ---
plt.subplot(1, 2, 2)
sns.histplot(df['carat'], kde=True, bins=60, color="purple", alpha=0.6)
plt.title('Carat Weight: Multimodal Distribution (Human Bias)')
plt.xlim(0, 3) # Limit x-axis to see the peaks clearly

plt.tight_layout()
plt.show()
No description has been provided for this image

Log Transformation.¶

Many machine learning algorithms (like Linear Regression) assume your data follows a normal "Bell Curve." If I feed them the raw, skewed diamond prices, the model will struggle. By compressing the large values with a logarithm, we force the data into a shape the model can understand.

The Logic¶

Problem: The price column spans orders of magnitude (e.g., 300.00 to $18,000). The "tail" is very long.

Solution: We apply $log(x)$ (or more commonly np.log1p(x) which is $log(x+1)$ to avoid errors with zero).

Result: A dataset that looks nearly indistinguishable from a perfect Normal Distribution.

In [6]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Load data
df = sns.load_dataset('diamonds')

# Create a figure with two side-by-side plots
fig, ax = plt.subplots(1, 2, figsize=(14, 6))

# --- PLOT 1: Original Skewed Data ---
sns.histplot(df['price'], bins=50, kde=True, color="skyblue", ax=ax[0])
ax[0].set_title("Original Price Data\n(Highly Right-Skewed)")
ax[0].set_xlabel("Price ($)")

# --- PLOT 2: Log-Transformed Data ---
# Apply natural log to the price column
# np.log1p calculates log(1 + x) to handle any potential zeros safely
df['log_price'] = np.log1p(df['price'])

sns.histplot(df['log_price'], bins=50, kde=True, color="green", ax=ax[1])
ax[1].set_title("Log-Transformed Price\n(Approx. Normal Distribution)")
ax[1].set_xlabel("Log(Price)")

plt.tight_layout()
plt.show()
No description has been provided for this image

Learning Points¶

  • Good Data: I looked for good dataset to understand how a gaussian bell is applied
  • Log transformation: When your numbers spans orders of magnitude it is necesary to make a log transformation, otherwise some of them could "pull" results. For example age and anual income
  • Magic Numbers: in Diamond industry "Magic Numbers" exist where cutters try to keep diamonds just above specific weights (like 0.5, 1.0, 1.5 carats) because the price jumps significantly at those thresholds.
In [ ]: